A multi-agent framework is a set of runtime contracts

Agents are not the hard part by themselves. The hard part is making the boundaries between agents, tools, handoffs, guardrails, streams, and traces explicit enough that the system can be tested.

Yesterday's useful planning work was a TypeScript implementation outline for a multi-agent orchestration framework. The public lesson is not the backlog itself. A backlog is private working material. The reusable lesson is the shape of the runtime boundary.

When people talk about multi-agent systems, the conversation often jumps straight to personalities: a triage agent, a specialist agent, a reviewer agent, a planner agent. That can be useful language for product work, but it is not enough for implementation. A framework needs contracts, not vibes.

The contract I want is simple: every moving part in the orchestration loop should have a typed boundary and a testable failure mode.

The primitive set matters

A useful multi-agent framework has a small number of primitives. The names can vary, but the responsibilities should stay separate:

  • Agent: the configured unit of work, with instructions, model settings, tools, handoff options, and guardrails.
  • Thread: the accumulated conversation and run history that can be formatted for a model without losing structure.
  • Tool: a callable capability with a schema, validation, execution, and an error surface.
  • Runner: the loop that decides whether to call the model again, execute a tool, transfer control, or finish.
  • Handoff: an explicit transfer of control to another agent, with optional context filtering.
  • Guardrail: an input or output check that can stop the run before bad state becomes accepted state.
  • Stream: a real-time event view over the same run instead of a separate execution path.
  • Trace: the observability layer that explains model calls, tool calls, handoffs, and failures after the fact.

Keeping those pieces separate prevents the runner from becoming a bag of special cases. It also makes the framework easier to test without a live model call for every behavior.

The runner is the product surface

Diagram showing a runner coordinating a thread, active agent, model call, tools, handoffs, guardrails, streaming events, and tracing spans through explicit runtime contracts.
The runner should coordinate explicit contracts instead of hiding orchestration logic inside one large prompt.

The runner is where the primitives become a system. It builds messages from the thread, calls the model, interprets the response, executes tools, records results, switches agents when a handoff is requested, and stops when a final answer is produced or a safety boundary trips.

If that loop is vague, the whole framework becomes vague. Tool errors are hard to represent. Handoff cycles are hard to detect. Streaming becomes a second implementation instead of an event view over the same run. Traces become console noise rather than a useful explanation of what happened.

So I like to design the runner as a state machine with boring exits: completed, max turns exceeded, guardrail tripped, tool failed, handoff accepted, or run errored. Boring exits are good. They make agent systems debuggable.

Tools and handoffs need the same discipline

A tool should not be just a function in a prompt. It should have a schema, a validation step, an execution wrapper, and a normalized result. That does two things: it protects the tool from malformed input, and it gives the runner one shape to feed back into the model.

Handoffs deserve the same treatment. A handoff is not a paragraph saying another agent exists. It is a capability exposed to the active agent, with a target, an optional filter over the carried context, and a lifecycle hook for the runtime. That makes it possible to test handoff detection, agent switching, context stripping, and cycle prevention without hoping the prompt behaves.

The same rule applies to guardrails. If a guardrail can stop a run, it should return a structured result and the runner should have a first-class tripwire path. Safety logic hidden in prose is too easy to bypass accidentally.

Streaming and tracing should not fork the architecture

Streaming often gets added late, after a blocking runner already works. That is where many frameworks start to split. The blocking path returns one result. The streaming path has its own event handling. Bugs get fixed in one path and missed in the other.

The cleaner design is to make streaming an event surface over the same execution semantics. The stream can emit agent start, tool call start, tool call end, handoff, message delta, run complete, and run error events. A collector can turn those events back into the same final result type used by the blocking runner.

Tracing has a similar job. It should not be scattered print statements. It should be a pluggable provider with spans around the run, model calls, tool execution, and handoffs. In local development that provider can write to the console. In production it can connect to real observability. The framework code should not care.

The implementation order I trust

The practical order is not to start with the clever orchestration demo. Start with the contracts:

  1. Scaffold the TypeScript package, tests, linting, and ESM build first.
  2. Define the core interfaces before implementing behavior.
  3. Build the tool system with validation and error handling.
  4. Add agents and threads as small wrappers around typed state.
  5. Implement the basic runner loop against mocked model responses.
  6. Add handoffs, guardrails, streaming, and tracing as separate capabilities with tests.
  7. Only then write examples that look like the product story.

That order keeps the framework honest. The examples are allowed to be pleasant, but the core should already be testable before the demo becomes convincing.

The rule I am taking forward

Multi-agent orchestration is not mainly about inventing more agent roles. It is about making the runtime contracts clear enough that roles can come and go without breaking the loop.

Prompt the agents; type the boundaries.

That is the design pressure I want in agent frameworks. Let prompts describe intent. Let code own tools, handoffs, guardrails, streaming, tracing, and failure modes. When those boundaries are explicit, the system is easier to ship, easier to observe, and much easier to trust.

All posts Back to projects