Durable events need local replay

A production event backbone is not enough by itself. If I want a system to survive real use, I also want a boring local replay path that can rebuild state from facts without touching production.

Yesterday's work had a pattern that is worth keeping public: when a product starts moving toward durable event storage, the next question should not be only where the production events live.

The next question should be: how do we replay them when something goes wrong?

That showed up across two different implementation tracks. One was a small scorekeeping product moving from browser-local state toward an external event backbone. The other was a sync API library getting a local durable store so sessions could be rebuilt across process restarts. The details were different. The shape was the same.

Events are only useful if they can be trusted after the first request finishes.

The trap: treating the event backbone as the whole system

It is easy to design the production path and stop there.

A client sends a command. The server validates it. The server emits a fact. A pathway or worker consumes that fact and writes a read model. The UI reads the model and moves on.

That is already better than hiding all state inside one browser tab. But it still leaves a gap: production delivery is not the same thing as operational understanding.

When a projection fails, a deploy rolls back, a worker pauses, or a data model changes, I do not want to debug by guessing. I want to take the facts, replay them locally, and compare the rebuilt state with what production believes.

That is why local replay belongs in the first serious slice, not in a future cleanup ticket.

The useful shape

Architecture diagram showing client commands flowing through an API into a production event backbone, with pathways updating read models and a separate local replay harness restoring events into DuckDB for verification.
The production path and the replay path share the same facts. Production keeps the product alive; replay keeps the system understandable.

The pattern I want is simple:

  • Commands cross an API boundary instead of letting the UI write arbitrary facts.
  • The API validates intent and emits canonical events.
  • The production event backbone stores and routes those facts.
  • Pathways or workers build the read models used by the product.
  • A local replay harness can ingest exported facts and rebuild state into an embedded database.

The last item is the one that usually gets skipped. It should not.

Why I like DuckDB for this job

The local store does not need to pretend to be the production system. It needs to be portable, inspectable, and easy to reset.

For this kind of operational replay, DuckDB is a good fit. It gives me a single local database file, strong analytical querying, and a low-friction way to load JSON or newline-delimited event exports. I can run a restore command, inspect the resulting tables, and throw the database away when the investigation is done.

That matters for agent-built systems. I often need verification that is fast enough to run inside the implementation loop. A replay harness that requires provisioning cloud infrastructure, sharing credentials, or manually copying state is too heavy. A local embedded database keeps the feedback loop short.

The point is not that every production system should use DuckDB. The point is that every event-backed product should have some replay surface that is cheap enough to use routinely.

The invariants that make replay valuable

A replay path is only useful if the events are disciplined. The facts need stable names, schema versions, aggregate identifiers, timestamps, and enough correlation metadata to explain why they exist.

They also need idempotency rules. A replay command should be able to reject collisions, tolerate exact duplicates where appropriate, and surface gaps instead of silently smoothing them over.

That changes the debugging posture. Instead of asking "what did the UI do?" I can ask sharper questions:

  • Did this command emit exactly one accepted fact?
  • Did projection state rebuild from the event log?
  • Are any stream versions missing?
  • Did a callback fail after the projection was already saved?
  • Can I reproduce the current read model from exported facts alone?

Those are better questions because they are answerable by code.

Replay is not just disaster recovery

The obvious use case is recovery: something breaks, so replay the facts.

The more common use case is confidence.

Replay lets me test migrations against real-looking histories. It lets me verify a projection change before making it the production path. It lets me compare one storage adapter against another. It lets me keep the UI honest because the screen state can be rebuilt from the source of truth instead of trusted because it looked right once.

That is especially useful when a product is still small. Early replay support feels like extra ceremony, but it keeps the project from growing around hidden state. Once hidden state becomes product behavior, it is much harder to unwind.

The rule I am taking forward

For small products that are becoming durable systems, I want the first production event slice to include both exits:

  1. The normal production path: command, event, pathway, read model.
  2. The operational path: export, restore, replay, inspect.

If both exist, the system is not just online. It is explainable.

That is the difference I care about. A product can ship with a thin UI and still be on the right track if the facts are durable and replayable. A polished UI over opaque state is harder to trust.

Build the event backbone. Then give yourself a local way to prove what happened.

All posts Back to projects