Bytes & Drafts

Working notes on code, hardware, and the people around them

Engineering, Philosophy, Computing History, Distributed Systems, Real-Time Systems

Software is when hardware changes, Part 2: When the clock disappears

If you have ever chased a bug where a system froze every few hours, only to wake up the moment you attached a debugger, you already know that time is rarely as simple as a counter that increments once a millisecond. Timers fire slightly early or slightly late, interrupts sneak in between instructions, peripherals run from their own oscillators, and distributed nodes see each other through links with variable delay. From the outside everything looks nicely clocked. Inside, every part of the stack negotiates its own idea of now.

· 22 min read
In this article
  1. When Clocks Lie
  2. The Hardware is Already Async
  3. Readiness as the New Contract
  4. What History Already Taught Us
  5. A Practical Checklist
  6. Where This Shows Up
  7. Testing Without Timestamps
  8. Why This Actually Matters
  9. The Takeaway
  10. References

If you have ever chased a bug where a system froze every few hours, only to wake up the moment you attached a debugger, you already know that time is rarely as simple as a counter that increments once a millisecond. Timers fire slightly early or slightly late, interrupts sneak in between instructions, peripherals run from their own oscillators, and distributed nodes see each other through links with variable delay. From the outside everything looks nicely clocked. Inside, every part of the stack negotiates its own idea of now.

In Part 1, "Software is when hardware changes", I argued that software is not an abstract thing that lives in source files. It only exists while some physical device is changing state in a particular order. A very sharp way to test that idea is to imagine taking the clock away. What does a program even mean if there is no metronome that tells every component when to move?

Real systems rarely reach that pure, clockless extreme, of course. What does happen in practice is that the neat, single global clock story starts to fray. Modern chips already juggle many clock domains that talk through synchronizers, wireless sensor nodes spend most of their time asleep and only wake when some external stimulus arrives, and distributed services coordinate through loosely synchronized clocks that are always a little wrong. In other words, lots of software already runs in environments where the clock is unreliable or absent as a coordination primitive.

This post is about what that does to our mental models. It is about what changes when you stop thinking in terms of a shared tick and start thinking in terms of readiness, dependency, and causality. That sounds high level, but it shows up in very concrete places: the layout of a PCB, the contract for an API, the shape of tests, and the way you write a retry loop.

When Clocks Lie

The traditional synchronous story feels wonderfully comforting. Somewhere on the board a crystal oscillates with impressive stability, the chip distributes that signal along a carefully tuned tree, and on each rising edge every register samples its inputs. We get to pretend that the entire system advances in discrete, globally shared steps, one cycle after another. At the design level, it is almost irresistible to annotate events with cycle numbers and treat those numbers as ground truth.

Physics only lets that illusion survive for a while. Signals propagate with finite speed and nonzero skew, supply voltages sag when too much logic switches at once, transistors age and drift, and there is always some region of the chip where the clock edge arrives a little earlier or later than you expected. The clock distribution network itself can consume a noticeable fraction of the total power budget and it becomes harder to keep under control as frequencies and die sizes rise.1 Synchronous design works by accepting all of that complexity, then spending buffers, guard bands, and verification effort to hide it.

If you just live at the software level, the clock continues that act of concealment. You see a monotonic counter through something like clock_gettime, and you build schedulers, timeouts, and backoff strategies that assume the value you receive is a decent proxy for real time. It usually is. Except when it is not.

That assumption leaks directly into code like the toy wait loop below:

async function waitForPeer(now = () => Date.now()): Promise<void> {
  const deadline = now() + 200;

  while (now() < deadline) {
    if (await peerReady()) return;
  }

  throw new TimeoutError();
}

If the clock pauses while the CPU is descheduled, or jumps backward because the hypervisor corrected drift, this loop burns CPU and still declares a timeout even though the peer may already have replied. Nothing in the logic asked "has the dependency become ready?" It only asked "has 200 ms expired yet?" and trusted the clock to answer perfectly.

The gap between apparent time and physical reality shows up any time software treats the clock as a strong ordering signal instead of a useful hint. In June 2012, a leap second exposed bugs in some Linux kernels that left servers stuck with near 100 percent CPU usage as kernel timers looped on inconsistent values.2 Services that had assumed a smooth, strictly increasing time source suddenly hit behavior that looked like a denial of service attack from their own infrastructure. In June 2017, Cloudflare wrote about an incident where several services misbehaved when different environments handled the leap second in slightly different ways, which left parts of the system disagreeing on whether that extra second had happened.3

These were not outages caused by a lack of time. They were outages caused by code that relied on a convenient abstraction of time that stopped matching what the underlying clocks were doing. The clock lied, or at least spoke with enough ambiguity that software could not interpret it consistently.

The Hardware is Already Async

It is tempting to treat asynchronous hardware as some exotic niche, strictly separate from the synchronous systems most of us program against. The reality is that the hardware under our feet has been quietly drifting away from a single shared clock for decades.

On a modern system on chip, various blocks run at different frequencies or even from different sources. A memory controller might run at one rate, a peripheral bus at another, and low power domains at yet another. Data crosses those boundaries through clock domain crossing logic such as asynchronous FIFOs and synchronizers that trade throughput or latency for metastability safety.4 Each domain is synchronous locally, but the boundaries are fully event driven. They are already places where no global cycle count exists.

Go up a level. Microcontrollers aimed at battery powered devices commonly expose several sleep modes where the main CPU clock stops, yet peripherals continue to run from independent oscillators. A real time clock can wake the system from deep sleep, or a pin change can trigger an interrupt while the core is halted.5 To software that insists on polling with busy loops, the world stops whenever the core sleeps. To software that is written in terms of events and readiness, the system has not stopped at all; it has simply changed which parts are allowed to move.

Take one more step out to distributed systems and the idea of a single, precise clock falls apart completely. Even in fiber, light only travels on the order of tens of centimeters per nanosecond, so clocks that are nanosecond accurate at different data centers would need to account for that propagation delay plus all the messiness of real networks.6 In practice, systems synchronize clocks using protocols like NTP or PTP, accept that each node has a bounded error, and design algorithms that assume clock readings are uncertain. The Spanner database, for instance, wraps clock uncertainty into a TrueTime API and forces transactions to respect those uncertainty bounds so that time based ordering remains safe even when underlying clocks drift or jump.7

Seen in that light, the idea of a single, global, perfectly trustworthy clock looks more like a useful training wheel. The deeper you go into the hardware, and the wider you go across networks, the more you find that components behave as little event driven islands that occasionally exchange messages. Time remains vitally important, but mostly as a relation that orders events, not as a scalar that every part of the system agrees on.

Readiness as the New Contract

If you take the global clock away as a coordination primitive, you have to replace it with something. The shape of that something depends on the level you care about, but the pattern repeats. You stop saying "do this at time T" and start saying "do this when condition C becomes true".

At the circuit level, asynchronous designers structure logic around handshakes. A block receives a request signal along with data, performs some work, then raises a response signal when the output is stable. Bundled data protocols combine a few wires of control with a bus of data and rely on relative timing; more delay insensitive variants encode extra information into the data path so that the receiver can detect completion without assuming a particular delay.8 In both cases, the key idea is that data validity is explicit. The block that consumes a value does not trust a clock edge. It trusts a contract that says when the producer has finished.

Software has a very similar shift available. Instead of sprinkling sleeps and timestamp comparisons through code, you can design around readiness. Network code that uses select, epoll, or async runtimes already follows this pattern: wake up when a file descriptor becomes readable or writable, then push work forward until you would have to wait again. Message queues and actor systems describe behavior in terms of messages that arrive, not time that passes. Even batch processing jobs can often be phrased as "process input X when this file appears" rather than "run every N minutes".

This change in contract is subtle but powerful. If you write a retry loop as "try every 50 milliseconds until it works", you have tied behavior to a specific time source. If instead you write "try when a backoff token tells you it is safe", you can implement that token using real time, using a logical clock, or using some piece of domain specific readiness such as "new data arrived from upstream". The calling code does not care which one you picked. It only cares that the condition captures when the world has become ready for another attempt.

A readiness-first loop in a real service often ends up looking closer to an event dispatcher than to a sleep/poll cycle:

while (true) {
  const event = await nextEvent(); // resolves only when something is ready

  if (event.type === 'job') {
    await process(event.job);
  } else if (event.type === 'sensor') {
    await handleSensorEvent();
  } else if (event.type === 'backoff') {
    await tryReconnect();
    event.release(); // spacing between attempts
  }
}

Every branch is phrased as "do work when X is ready" instead of "do work after N ms". nextEvent() can internally race promises, listen to interrupts, or wait on logical counters, and the calling loop does not need to know which one implements the contract today.

Designing this way tends to flush out assumptions that are otherwise hard to see. You notice where you really need timeouts because a peer might never respond. You notice where you can naturally treat progress as a function of data flow instead of wall time. Most usefully, you gain explicit places to attach information about uncertainty and failure. A readiness condition can carry a reason, not just a Boolean flag.

What History Already Taught Us

None of this is new. Hardware and distributed systems researchers have been wrestling with life beyond the simple clock for a long time, and some of their work reads today like a field guide for the "no reliable clock" world.

The transputer, designed in the 1980s by David May and colleagues, aimed to make communicating processes a first class abstraction in hardware.9 The chip integrated fast links between small processor cores and built on the same ideas that C. A. R. Hoare described in Communicating Sequential Processes: explicit channels, synchronization on message passing, and a focus on causality over shared state.10 While the transputer itself did not take over the world, its influence shows up in modern message passing concurrency libraries and in the way some network on chip fabrics are structured.

Asynchronous circuit designers went further and asked what it would look like to build whole processors without global clocks. Work by Alain Martin and others explored self timed pipelines where each stage proceeds when the previous one finishes and the next one is ready, which produced designs that can save power and adapt their speed naturally to variations in voltage and temperature.11 Later, the SpiNNaker project built a large scale neuromorphic system with thousands of cores that communicate using an event driven interconnect; the emphasis there was on robustness and efficiency in the face of massive concurrency, not on any particular global frequency.12

Distributed systems theory has its own deep well of insight about life without a shared clock. Lamport's classic paper on time and ordering in distributed systems shows how you can reason about causality using logical clocks that increment on events and piggyback on messages, without any need for synchronized physical clocks.13 That work underlies a lot of modern thinking about vector clocks, version vectors, and conflict free replicated data types. If you have ever debugged a weirdly interleaved log from a cluster and wished every line carried a partial order instead of a raw timestamp, you are feeling exactly the same pressure that motivated those ideas.

When I read these histories as a software engineer, I mostly take away humility. People have been building real, robust systems that run in awkward timing environments for a very long time. Many of the patterns that we rediscover at the application level, such as idempotent operations, explicit acknowledgement, and causal ordering, were already invented in other layers where the lack of a reliable clock was unavoidable.

A Practical Checklist

If all of this stayed at the level of elegant theory, it would be interesting but not hugely helpful during a code review. So here is a more grounded set of questions I like to ask when working on systems that either do not have a trustworthy clock or should not depend on one as much as they currently do.

First, which parts of the design really need wall clock time and which parts only need ordering or rate limiting? For example, user facing expiry often must be tied to real time. A password reset email that stays valid for hours instead of minutes is a security problem. On the other hand, retry loops inside an internal service are usually more concerned with not hammering a dependency than with aligning to real minutes or seconds. Those loops can often be driven by a logical backoff counter or by readiness signals from the dependency instead of by strict timers.

Second, where are you using timestamps as uniqueness tokens or as encodings of identity? This is one of the easiest ways to get into trouble when clocks drift or jump. Several real incidents have come from systems that generated identifiers using coarse timestamps, then saw collisions or reordering when clocks were adjusted.14 Using random or structured identifiers that do not carry time information avoids those failure modes entirely. If you want time for debugging, attach it separately as an annotation.

Third, do your protocols and data structures express causality directly, or do they rely on everyone sharing the same notion of before and after? For instance, if you have two replicas processing updates, do you model conflicts in terms of "this happened after that" using explicit version vectors or logical clocks, or do you simply compare timestamps and pick the larger one? The second option is simpler and looks fine until a leap second or a misconfigured NTP client makes some timestamps jump backward.2 The first option takes more thought but tends to fail more predictably.

Fourth, can the system make forward progress when the clock is wrong, slow, or slightly frozen? The 2012 leap second issues were extra painful because some components got stuck in busy loops that assumed time would move smoothly forward. A useful design question is: "What does this component do if time stops?" A design that continues to process queued work, or that times out based on the amount of data processed rather than the wall clock, is often more robust than one that treats time as a heartbeat.

Finally, do your tests encourage healthy or unhealthy attitudes toward time? If all your integration tests spin on real time sleeps, it becomes very easy to write production code that does the same. Tests that use simulated time, or that drive systems with explicit events instead of timers, tend to push designs toward clearer readiness contracts.

None of these questions are silver bullets. They mostly serve as prompts that make hidden dependencies on the clock visible, which is the first step toward removing or at least documenting them.

Where This Shows Up

Once you start looking for places where software runs with a weak or missing clock, it becomes hard to stop seeing them.

Embedded development is full of them. Small sensors that harvest energy from the environment may only have enough charge to run for a short burst whenever a capacitor crosses some threshold.15 Instead of a regular tick, the entire computation is gated by the arrival of energy. Data must be stored in ways that survive brownouts, and work must be broken into idempotent chunks that can resume after unexpected pauses. In that world, time as an ordering relation is tightly coupled to the physics of the harvester, not to the tick of a quartz crystal.

On the server side, highly virtualized environments can present surprisingly poor clocks. Virtual machines may be paused, migrated, or oversubscribed in ways that make perceived time pass irregularly. If you are unlucky, you may even see clocks jump backward when a hypervisor corrects drift. That is one reason large operators invest heavily in robust time distribution and monitoring, and publish detailed descriptions of how to design for clock uncertainty.147 Application code that treats time as slightly fuzzy tends to behave much better than code that assumes a perfect counter.

Even apparently simple user interfaces can have a weaker relationship with time than you might think. Debounced inputs, progressive backoff in search boxes, and delayed validation of forms are often framed as "wait N milliseconds before doing X". Under the hood, these features are usually more robust when expressed in terms of readiness conditions: "act when the input has been stable for some minimum interval" or "refresh when the previous request has either completed or clearly stalled". The point is not to avoid time entirely. The point is to describe behavior in terms of what has happened, not in terms of what some clock should have done.

Finally, distributed tracing and logging infrastructure lives directly in the tension between wall time and causality. Trace visualization tools work much better when spans carry explicit parent child relationships rather than relying purely on timestamps to infer order. As anyone who has tried to align logs from two data centers knows, a neat timeline based on timestamps can be wildly misleading when clocks differ by even a few milliseconds. Expressing relationships explicitly often matters more than shaving a little error off the time sync algorithm.

Testing Without Timestamps

If you accept that real clocks are messy, the obvious next question is how to test systems that deal with time. It is tempting to give up and say "real time is the only thing that matters", but that route leads directly to tests that are slow, flaky, and hard to interpret.

One option is to introduce explicit time sources as dependencies instead of calling system APIs directly. In many languages you can define an interface that exposes "now" and scheduling operations, then implement that interface with both a real clock and a simulated one. Unit tests can drive the simulated clock forward in controlled ways, which lets you exercise edge cases such as large jumps, pauses, or rapid sequences of events without waiting in real time. Some runtimes and testing libraries even provide virtual time facilities for this purpose.

it('retries even when time jumps', () => {
  vi.useFakeTimers();                 // or jest.useFakeTimers()
  const retry = new Retry();

  retry.failOnce();                   // schedules first backoff
  vi.advanceTimersByTime(5_000);      // simulate a paused VM resuming later

  expect(retry.readyToRetry()).toBe(true);
  vi.useRealTimers();
});

The production code only depends on setTimeout/setInterval, while the test drives Vitest/Jest fake timers forward in single steps. That makes it trivial to reproduce scenarios where time jumps, stalls, or advances faster than real wall time without waiting for any of those things to happen on their own. The important part is that tests can advance or rewind the timer source intentionally.

Another useful technique is to structure tests around observable events rather than elapsed time. For example, instead of asserting that "after 200 milliseconds the retry counter equals 3", you can assert that "after three failed calls, the system waits at least once before trying again". The former depends heavily on the exact timing parameters and on the scheduling behavior of the runtime. The latter describes the causal structure you care about. It is usually more stable under refactoring, and it continues to hold even if the implementation changes from time based backoff to a readiness based scheme.

In integration and system tests, it often helps to treat the clock as just another unreliable dependency. You can deliberately skew clocks between nodes, inject leap seconds, or simulate time sources that jump backward or forward within reasonable bounds. Real operators do this at scale. Cloudflare, for instance, described exercises where they forced parts of their infrastructure to experience unusual time behavior so they could observe which components coped well and which ones did not.3 Borrowing a smaller version of that attitude for test environments makes time related bugs surface earlier, when they are cheaper to fix.

The common pattern in all of these techniques is that tests talk about ordering, causality, and readiness more than about specific wall clock values. When tests are written that way, they naturally push production code in the same direction.

Why This Actually Matters

All of this can sound a bit philosophical, especially if most of your day job involves writing request handlers and database queries. The clock on the wall seems to tick just fine. Why should you care that the crystal on the board and the NTP server on the network are only approximations?

You should care because time has a nasty habit of hiding inside assumptions. When a system fails because a machine ran out of memory, the cause usually shows up in a log or a core dump. When it fails because two components disagreed by a few milliseconds about the order in which things happened, you may only see a cascade of secondary symptoms. Time related bugs are often some of the hardest to reason about and reproduce.

You should also care because systems are changing in ways that stress the old mental models. Power sensitive devices want to sleep most of the time and only wake when the world pokes them. Global scale services stretch across distances where speed of light delay matters, and they operate through layers of virtualized infrastructure where clocks can stall, drift, or jump. In those environments, pretending there is a single, trustworthy clock is less and less tenable.

Most importantly, you should care because treating time as a dependency instead of a number gives you better tools. It encourages you to ask what really needs to happen before what, to make that ordering explicit, and to build protocols that keep working even when the underlying clocks misbehave. That shift does not just make designs more elegant. It makes the resulting systems more tolerant of the messy, physical reality they live in.

The Takeaway

Async design can feel like an implementation detail if you never look below the abstractions and stay within a cozy, single process runtime. The moment you step outside, into multiple clock domains, deeply sleeping firmware, or services spread across continents, it becomes the whole computer.

Software still only exists while hardware changes. When change is no longer paced by a friendly global clock, software has to say which change should trigger the next one. That is the heart of the matter. Think and design in terms of readiness, causality, and explicit dependency, and the hardware, clocked or not, usually manages to keep up. Ignore those relationships, and sooner or later the clock you relied on will surprise you, often on a late night when the pager goes off and the graphs are flat.

References

1

S. Rajapandian, "Clock Network Design: A Survey", IEEE Design and Test of Computers, 2005. https://ieeexplore.ieee.org/document/1544345

2

Jonathan Corbet, "The 2012 Leap Second", LWN.net, July 2012. https://lwn.net/Articles/499410/

3

John Graham-Cumming, "How and Why the Leap Second Affected Cloudflare DNS", Cloudflare Blog, January 2017. https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/

4

Semiconductor Engineering Knowledge Center, "Clock Domain Crossing". https://semiengineering.com/knowledge_centers/eda-design/verification/clock-domain-crossing/

6

Andrew S. Tanenbaum and David J. Wetherall, Computer Networks, 5th Edition, InformIT, 2010. https://www.informit.com/store/computer-networks-9780132126953

7

James C. Corbett et al., "Spanner: Google's Globally Distributed Database", OSDI 2012. https://www.usenix.org/system/files/conference/osdi12/osdi12-final-16.pdf

8

"Asynchronous circuit", Wikipedia. https://en.wikipedia.org/wiki/Asynchronous_circuit

9

David May, "The Transputer Architecture", Byte Magazine, August 1989. https://archive.org/details/byte-magazine-1989-08/page/n197/mode/2up

10

C. A. R. Hoare, Communicating Sequential Processes, Prentice Hall, 1985. https://www.cs.cmu.edu/~crary/819-f09/Hoare78.pdf

11

Alain J. Martin et al., "The Design of an Asynchronous Microprocessor", Integration, the VLSI Journal, 1990. https://authors.library.caltech.edu/records/yayps-tds62

12

Steve Furber et al., "Overview of the SpiNNaker System", Proceedings of the IEEE, 2014. https://ieeexplore.ieee.org/document/6987310

13

Leslie Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System", Communications of the ACM, 1978. https://lamport.azurewebsites.net/pubs/time-clocks.pdf

14

David L. Mills, "Network Time Synchronization: The Network Time Protocol", CRC Handbook on Architectures for Digital Signal Processing, 1989, and later updates. https://www.eecis.udel.edu/~mills/ntp.html

15

Nihal Ahmad et al., "Context-Aware Management of IoT Nodes: Balancing Informational Value with Energy Usage", arXiv:2511.09111, 2025. https://arxiv.org/abs/2511.09111

Comments