GATE THE FLOW,
NOT THE JUDGMENT
You can pull the human out of a multi-agent pipeline's per-checkpoint loop with an automated gatekeeper that emits proceed, revise, or escalate. The one thing it must never do is guess what you would have wanted.
When you automate a review gate in a multi-agent pipeline, the first instinct is to make it smarter. Let it watch what the operator approves, learn the patterns, and start filling in the call on its own. Give it enough history and it will know what you would have said.
That is the wrong move. The sharpest decision in this design is the one the gatekeeper refuses to make. It emits exactly three outcomes (proceed, revise, or escalate), and it is forbidden from modeling what the operator would prefer. The gate governs flow. Preference is someone else's job, and that someone does not exist in the system yet. Automate the gate, not the judgment.
The human in every checkpoint
Picture the pipeline. An analyst frames the problem, an architect turns it into a plan, a critic tears holes in the plan, a prompt-engineer scopes the work, and a fleet of coders does it. Between every stage there is a handoff, and at every handoff a human looks at what came out and says: proceed, send it back, or stop and bring this to me. That human is the correct starting point. They carry the context, know what the work is for, and catch the things no checklist anticipated.
They also do not scale. The operator becomes the bottleneck, the pipeline stalls the moment they step away, and a system that needs a person at every gate runs at the speed of one person's attention.
So you reach for the obvious fix: replace the human at the gate, and make the replacement good by teaching it what the human would have approved. This is where it goes wrong. An approver that models what the operator would prefer optimizes for predicted approval, not for whether the work is correct, and those two targets come apart fast. It does not reduce failures. It learns which failures the operator tends to wave through, and hides them in exactly the spots the operator stopped looking. Call it a sycophancy engine at the checkpoint layer. The rest of this design is built to make that engine impossible to build by accident.
The verdict primitive
Start by replacing "looks good" with something a machine can produce honestly: a structured verdict, bound to the exact thing it judged.
A verdict has a fixed shape:
Decision: proceed | revise | escalate Artifact-hash: SHA of the exact artifact reviewed Uncertainties: what the reviewer was unsure about Rationale: 1-2 sentences Required-changes: tagged by root cause (requirements | architecture | prompts | none) Escalation: one-line reason, or none Ledger-pointer: where this verdict is recorded
The load-bearing detail is the hash. Every verdict is bound to a specific triple: which checkpoint, which attempt, and the hash of the artifact under review. Change the artifact and the prior verdict is dead, because the binding no longer matches. No stale "proceed" carries forward onto a file that has since been edited. That whole class of mistake, approving version one and shipping version two, is not one you can make here. The mechanics prevent it.
The verdict also forces a sharper question than "is this good." At a coding checkpoint the primary check is concrete: did the orchestrator close every blocker the critic raised with a real fix, or quietly wave some through? Around that sit a handful of named failure patterns: a spec that leans on something only ever said in conversation and never written into an artifact; acceptance criteria that do not map one-to-one onto the plan's phases; a phase with no test coverage; model routing that does not match the weight of the task; scope that has crept past what was asked. None of these require taste. They require reading the artifact against a checklist.
Gate versus judgment, the hard boundary
Here is the core claim. The gatekeeper's entire scope is "can this proceed?" That question is kept structurally separate from "would the operator want this?", and the separation is enforced by what the role is not allowed to read and not allowed to decide.
"Flow-and-gating role only" is not a line in a policy doc. It is a structural absence. There is no preference model anywhere in the system, and no placeholder waiting for one. Consider what a preference model would need: a record of past operator decisions, the patterns in what got approved and what got bounced, signals about the operator's taste. None of that is in the gatekeeper's context. It cannot model preferences because it does not have the inputs. And it does not have the inputs because nobody staged them in.
That last part is the mechanism, not a prompt convention. The gatekeeper is spawned headless under a system prompt that forbids it from loading the main framework context. It reads only a staged per-checkpoint directory: the artifact under review, a bounded slice of the log, the state file, the conventions doc, and its own persona file. The full run history is physically outside its read scope, because nobody copied it into that directory. A model that starts to drift cannot re-read the whole story to reconstruct what the operator has been doing, even if it tries. The wall is made of what is on disk, not what the prompt asks for.
This is the failure mode the boundary prevents. The point is not ethics. It is whether the pipeline's verdicts mean anything. A "proceed" that means "I think you'd have said yes" is worth nothing. A "proceed" that means "this passed a fixed checklist against a hashed artifact" is worth something you can build on. Any checklist change that smuggles in preference signals is, by definition, a boundary violation, and the definition exists for a reason this article returns to at the end.
Escalate, don't guess
The gatekeeper does not interpolate between proceed and escalate. It escalates on a defined list and only that list, so the line stays predictable.
It escalates when a critic's blocker got a fix that looks thin or wrong. When a scope decision materially changes cost or timeline. When two equally valid approaches exist and the call is a real tradeoff that belongs to the operator. When the work touches anything sensitive: a production database, billing, auth. On any production or public-shipping action, anything destructive or hard to reverse, or any change to secrets or access. When a blocker stays unresolved, a revision cap is exhausted, or two specialists conflict and the written evidence cannot settle it. And when the orchestrator is on a path that is spec-compliant but violates doctrine: over-engineering, or building machinery a convention already covers.
The default failure direction is the whole point. A missing verdict, or one that timed out, escalates. It never auto-proceeds. "Ask the human" is the default and "guess and continue" is forbidden, because a system that proceeds on silence is a system that ships on silence. Revise loops cap at two by default, and hitting the cap is itself an automatic escalate.
The plumbing that makes it cheap
A human-replacing reviewer sounds expensive, and the wait sounds most expensive of all: a long-running model session parked at a checkpoint, burning tokens while it idles. It is not, because the coordination mechanism is a file mailbox and nothing more.
When the orchestrator reaches a checkpoint, it writes a request file (the artifact path, its hash, the question, the checkpoint id, the attempt, the revise count), fires a background file-watch script, and ends its turn. It is now spending nothing. A watcher process polls for request files with no matching verdict; when it finds one, it spawns the gatekeeper headless, one shot. A shell bridge validates the structured output, writes the verdict atomically by renaming a temp file into place, and appends it to a ledger. The watcher exits the instant the verdict appears, fires one completion notification, and re-invokes the orchestrator. No broker. No message queue. No shared memory. No parked session.
The ledger is append-only. Every verdict lands with its checkpoint id, attempt, timestamp, decision type, uncertainties, rationale, a borderline flag, and references, plus an "operator's call" field filled in later if an escalation resolves into a human decision. The bridge never edits a prior entry. At the end of a run, a summary groups decisions by type and surfaces every escalation and borderline call, so the operator can see in one place where the run actually needed them.
Where it breaks
The design has real limits.
The gatekeeper is only as good as its checklist. A criterion the checklist does not cover is a criterion the reviewer will not catch. The checklist is not magic; it is a maintained artifact written by people, and it drifts as the work changes underneath it. A gate is exactly as sharp as the list behind it, and that list needs tending.
The boundary needs active defense. The pressure to make the gatekeeper smarter is constant, because a reviewer that knew the operator's taste really would clear more checkpoints without a human. That is the appeal and the trap. Any checklist evolution that introduces preference signals (past operator decisions, approval patterns, inferred taste) crosses from gating into judgment. The reason "boundary violation" is a defined term is that someone will propose crossing it, in good faith, every few months.
An automated approver can also rubber-stamp. If the checklist is under-specified or the context is stale, the gatekeeper can consistently proceed on things it should have escalated. The hash binding stops a stale verdict from carrying forward onto a changed file, but it does nothing to stop a fresh bad verdict on a fresh artifact. A confident wrong "proceed" on the artifact in front of it is still a wrong proceed.
So it does not run fully unattended. It stays attended until a self-audit gate reads clean: a cold auditor blind-samples the gatekeeper's "proceed" verdicts, and any mismatch becomes a calibration signal surfaced to the operator. That discipline earns its place precisely because the rubber-stamp failure is real.
What it is not allowed to decide
Gate the flow. Leave the judgment to the human, and when the human is not there, escalate until they are.
A reviewer that stays in its lane is trustworthy: its "proceed" means a checklist passed against a known artifact, and nothing more, which is exactly why you can rely on it. A reviewer that starts inferring preferences is neither a gatekeeper nor a good stand-in for the operator. It is a worse version of both.
The thing this buys is specific. You hand a task brief to the framework and you are not pulled back in unless a real escalation fires, and when one does, you trust it fired for a reason on the list, not because a model guessed you might want to look. That is not bought by making the reviewer smarter. It is bought by keeping its scope narrow and its escalation signals honest, and by defending, every time the pressure comes back, the small set of decisions it is not allowed to make.