THE LEDGER · ENTRY III · FRAMEWORKS

THE HARNESS
IS THE PRODUCT

Why I built a multi-agent workflow engine instead of better prompts — role-based model routing, isolated agents, and gates a model can't fake.

2026 · VI · 22  —  11 MIN  —  CHECKED COLD BEFORE ENTRY

I didn't set out to build a multi-agent framework. I set out to stop being the integration layer between my own Claude sessions.

The pattern was always the same. One session drafted a spec. I pasted it into another, which argued architecture. I pasted the plan into a third, and code appeared. Then I diffed it by hand, found three contradictions the model had forgotten since page one, and started over.

The models were not the problem. The harness was.

More precisely, the harness was me: tired, holding the whole context in my head, deciding which answer counted, remembering what the last reviewer said, re-checking whether "fixed" meant fixed. Nothing on disk was authoritative. Nothing outside my own attention enforced the workflow.

So I built the harness.

The Bureau

The system is called the Bureau. It lives as one global install, and it is not a wrapper around a chat window. It's a workflow engine: a task registry, isolated run directories, specialist personas, model routing by role, human checkpoints, and gates that something other than a model has to clear.

The Conductor, the central session, writes no specs and no code. It triages the task against the registry (bug fix, feature, execute-plan, operational-build) and spawns specialists into fresh contexts.

That isolation is the whole point. A cold reviewer is only cold if it never watched the plan get argued into shape.

The cast is functional, even where the names are theatrical:

Analizer 2000 writes requirements and surfaces hidden assumptions.
The Architect designs the system and maps how each piece depends on the others.
The Challenger runs cold, adversarial review.
The Spellwright turns approved plans into scoped build prompts.
The build party (The Mage, The Systemsmith, The Mechanic) handles interface, backend, and infrastructure.

Yes, those are the names. If I'm going to live inside a system for months, it should have enough personality that I want to come back to it.

Every run gets its own directory:

output/runs/<date>-<slug>/

Machine state lives in state.json. Human history lives in log.md. Specs, plans, briefs, prompts, reviews, build outputs are all ordinary files. The conversation is never the source of truth. The artifacts are.

Routing is by role, not by whichever chatbot I like that week. The framework defines provider-neutral tiers (cheap, standard, strong, frontier) and a resolver maps them to whatever host is running the work. Scoot and Tally are the shop droids: Scoot answers "does this path exist?", Tally maps every place a symbol appears. Neither needs a frontier model. The Conductor runs strong; the Challenger runs strong in a fresh context; ad-hoc spawns never inherit a premium session, because that is how you burn tokens grepping a log file.

Code-changing runs get a git worktree each. Branch, edit in isolation, merge at close-out. Two sessions can work the same repository without stepping on each other.

I've shipped real work through this: a memoir-writing app built from 26 scoped prompts, now in active dogfooding. A ticket tracker, live in production. A read-only dashboard that scans state.json across installs. A Telegram assistant exposing ticket and memory tools over MCP.

Four production systems, one person. The harness is the reason. Not because it writes better code than I do, but because it holds the context I can't, and won't call a thing done until an artifact on disk says so.

Where the system lied to me

That's the harness working. But I trust it most in the opposite case: when something inside it is wrong and the system catches the lie anyway.

Every trusted green signal in this system has lied to me at least once: my own diagnosis, the cold reviewer, the passing fixture. Each time, the thing that caught the lie was the same: something repo-aware, reading the actual code, outside the party that wanted to ship. Here are four.

The run where green meant wrong

I was designing a Delegate, a narrow agent to hold routine checkpoint gates so runs could continue while I was asleep, or parenting, or doing any of the parts of life that aren't pasting agent output between windows.

The Delegate was deliberately limited. Flow and gating only. Not a preference model, not "what would the Visionary want?" Just: does this artifact meet the written spec, checklist, and acceptance contract?

Before building it, I stress-tested the design. I ran the document past five general-purpose models. The feedback was useful. One caught fused responsibilities. One sharpened the audit trail. One pushed me to separate Delegate and Notary concerns. By the fifth review the returns had collapsed into agreement, and it felt earned.

Then I ran a repo-aware reviewer in the actual repository, with the relay telemetry in front of it.

My central diagnosis was wrong.

I'd assumed the cost problem was the growing Conductor transcript, checkpoint after checkpoint dragging more history through the model. Clean diagnosis, clean fix: move checkpoint review into a short-lived file-mailbox process. Read artifacts from disk, write a verdict, exit.

The reviewer looked at the numbers. The run had 52M input tokens. 50.2M were cached, roughly 96%. The sink was never the transcript. It was 126 terminal round-trips, redraw noise, the same artifacts re-verified over and over, and one resumed session that spun for eight minutes producing nothing.

It also found two design claims I'd been treating as facts. I'd said the Delegate persona would clear the framework's battle-test gate; it checked the rules: that gate applies to workflows, not personas. I'd said the Delegate could intercept an existing review step as a checkpoint; it traced the workflows: that step runs straight through, and intercepting it means adding gates to the workflows, not making a small local substitution.

Five reviewers with only my design document converged on a confident wrong story. One reviewer with the running system found what was actually broken.

That became a routing rule:

Route to ground truth, not to a vote.

(I dig into that consensus failure on its own terms in Who Checks the Checker?)

The role still earned its place

The cost diagnosis was wrong. The role was not.

I had a hard run sitting in the output directory: a 109 KB spec, a 94 KB prompt set, a shell-and-jq feature with an ugly validation contract. The Challenger had passed the prompts. A focused verification pass reported zero blockers.

The Delegate prototype kept going, and caught twenty build-breaking defects across three rounds.

Not style notes. Build breakers. A function called before it was defined. A script aborting on a missing file. A boolean handled as a string. An atomic write that wasn't atomic across filesystems. A schema key drifted from the contract.

The Challenger had passed every one.

That settled something I'd been unsure about. A second reviewer earns its place when it checks how the first reviewer's findings were handled, not "does this read well cold?" but:

Does the claimed fix have evidence in the changed text?

Nobody else in the pipeline was asking that.

It also narrowed the cost story honestly. You don't make a 94 KB prompt review cheap by pretending the bug lives on page one. The real win is smaller and more mechanical: kill the live-terminal overhead, kill cross-round accumulation, run the review in a fresh process: read files, write verdict, exit. Not magic. Just less machinery left running.

The expensive loops were upstream

Once I stopped blaming the wrong thing, another pattern surfaced.

The expensive loops weren't mainly The Challenger being picky. They were Analizer 2000 and The Architect drifting apart. Analizer writes the requirements. The Architect designs against them (sometimes reading them differently, sometimes barred from editing them), so it bolts on an addendum. Now the second half quietly contradicts the first.

The Challenger catches it, but too late. Full loop. Route back, rewrite, reconcile, review again.

The fix is small: one cheap pass where the analyst re-reads the architecture before the expensive review. The analyst owns the requirements, so the analyst fixes the drift in place, and the contradiction never reaches The Challenger.

The Delegate makes review loops cheaper. The analyst pass makes them fewer. Different jobs. Fusing them would make both worse.

CI is the gate, not another voice

I'm building toward mechanical enforcement, not a bigger panel of advisory reviewers.

The execute-plan workflow already supports regression fixtures inside the run directory, shell commands with expected output. Before every build-prompt dispatch, the fixtures run again. A failure blocks the prompt. Not "the Conductor thinks we're fine." The command passes or it doesn't.

Last week I learned that fixtures can false-pass in ways that look exactly like green CI. One static-grep guard matched its token inside a comment line: delete the actual guarantee from the code and the fixture still passed. The fix was mutation testing: delete the guaranteed line from a copy and prove the fixture fails. The same run turned up a portability bug: one platform's grep mishandled a literal $ mid-pattern unless the fixture used grep -F.

Two fixtures, same class of problem. A green check is only useful if you also test whether it can go red.

That's the shape of gate I want everywhere. Not "the model said proceed," not "five models agreed." A script. A schema check. A cold attestation on a sealed artifact packet. Something outside the party that wants to ship. The Notary role is exactly that: external cold attestation on a sealed packet. The Delegate holds the routine checkpoints; the Notary samples whether the Delegate's "proceed" calls were actually justified against the artifacts.

A gate is only real if something outside the gated party can check it.

What I haven't shipped yet

Honest inventory:

The Delegate is designed and benchmarked on real build-breakers. It is not in production. The open question is whether the file-mailbox version actually beats the live relay on cost across a full feature run.
The analyst-drift pass is identified, not yet wired into every workflow.
CI-as-merge-gate exists at the run-fixture level, but the loop isn't closed so a repo-level merge is blocked on fixture-green the way unsupervised overnight runs demand.
The Principal is explicitly deferred. That's the agent that will eventually act on my behalf according to my decision principles. I'm not pretending it exists. Flow-and-gating comes first.

That order matters. It's tempting to jump straight to "an agent that knows what I'd do." I don't think that's the right first abstraction. I'd rather build a system that can prove an artifact met its written contract before I build one that claims to know my taste.

The harness decides

The models are instruments, not competing chatbots. Different hosts for different work: orchestration and build, repo-aware review with receipts, gateway-routed models where the product needs them. Different costs, different context behavior, different failure modes.

The harness is what decides:

which agent runs
which model tier it gets
which files it may read
which artifact it must produce
which gate checks the result
what happens when the gate fails

That is the product. Not the prompt. Not the model. Not the chat transcript. The harness.

The Bureau still breaks in predictable ways: cold reviewers rubber-stamp, fixtures false-pass, paper designs survive five model reviews until something repo-aware reads the code. Each failure goes into a cross-run lessons log, and when the same shape shows up twice it gets promoted into the framework. That's the loop I trust: build the harness, run real work through it, find where it lied, and fix the part that let it. Then leave it running overnight. The model was never the product.

END OF DISPATCH · ENTERED INTO THE RECORD