The default mental model for AI-assisted work is the agent loop. Give the model a goal, a set of tools, and a system prompt. It decides what to do next based on what just happened. The control flow belongs to the model.
The inversion is simple. You define the steps in code. Each step invokes the model with a tightly scoped prompt and a tightly scoped tool boundary. The output of one step becomes the input of the next, validated by code, not by another model call. The control flow belongs to the engineer.
That trade is not subtle.
| Dimension | Agent loop | Workflow |
|---|---|---|
| Control flow | Model decides | Code decides |
| Token cost | High and variable | Bounded and predictable |
| Debuggability | Trace the model's reasoning | Read the code |
| Security surface | Whatever the model picks up | Whatever you allowlist |
| Failure modes | Discovered at runtime | Tested at build time |
| Best for | Open-ended exploration | Knowable, repeatable work |
Most real engineering work sits closer to the second column than the first. We keep reaching for the agent loop because it is what the tools default to, not because it is what the work needs.
The honest answer to "should I build this today or wait?" Build it today if you have one workflow that already runs by hand and you want to make it inspectable, governable, and cheap. Wait if you are looking for a polished framework. Anthropic appears to have a Workflow tool in flight (a changelog line for 2.1.147 referenced it before being pulled), and when it ships, the official version will almost certainly be better than what I show below. The pattern, though, will be the same. Building the rough version now is how you learn what to ask of the official one when it lands.
I have written before about the spec, standards, specialists loop and the Context Spine as the durable artifact that travels through it. Both posts were about pattern. This one is about the code, and the day I caught the workflow doing something an agent loop would have missed.
The moment the inversion clicked
Here is the representative failure mode that made the pattern click. The spec was small: a slugify function with four acceptance criteria and a coverage threshold of 0.85.
The plan step produced a clean file list. The implement step compiled and linted. The test step passed. Coverage cleared the threshold. Three for three. That is exactly where this kind of work gets dangerous.
The review step blocked.
It mapped each acceptance criterion to evidence in the diff and found one criterion with no asserting test: "collapse consecutive hyphens." The implementation handled it correctly, the path executed in another test as a side effect, and line coverage looked healthy. The acceptance criterion was satisfied in production and unverified in CI.
A coverage-only completion check would have shipped that. Tests were green. Coverage was high enough. The workflow asked a different question: does every acceptance criterion have evidence? It did not. The validator failed closed.
That is the inversion in one paragraph. The model still did the judgment work. The code decided what counted as done.
The Context Spine, in code
The Spec Is the Product made the case for a versioned specification as the source of truth. The companion idea is the Context Spine: the structured artifact that carries the spec, the plan, the implementation notes, and the review findings through every stage of the pipeline.
In an agent loop, the spine lives in the model's context window. It gets compressed, paraphrased, and occasionally forgotten as the conversation grows. In a workflow, the spine lives on disk and in code. Each step reads it, enriches it, and writes it back. No paraphrase. No forgetting.
That is the design goal of the file below.
What the workflow does
Four steps. One spec in. One result out.
- Plan: Read the spec. Explore the codebase. Produce a JSON plan of files to touch, the approach, and known risks. Read-only tools.
- Implement: Take the plan. Write the code. Validate by running the build and the linter. Fail if implementation touches files outside the plan.
- Test: Write tests that map to each acceptance criterion. Run the suite. Validate coverage against the spec's minimum.
- Review: Read the diff. Map every acceptance criterion to evidence in the code. Block on uncovered criteria or blocking findings.
Every step has a deterministic gate. Build passes. Lint passes. Coverage clears the threshold. The review step asks the model to map each acceptance criterion to concrete evidence, then the validator checks that every cited file and line range exists. The model proposes the evidence. Code decides whether the step passed.
A note on the SDK before the code
Two details in the Claude Agent SDK are easy to get wrong, and getting them wrong silently breaks the whole pattern. I will name both here so the code below makes sense.
allowedTools is the auto-approval list, not the restriction list. The SDK docs are explicit: passing ["Read", "Grep"] does not mean "the model can only use Read and Grep." It means "Read and Grep are pre-approved without prompting." To get actual restriction, you pair allowedTools with permissionMode: "dontAsk", which denies anything not on the list without prompting. The combination is the real security boundary. The popular acceptEdits mode auto-approves edits but does not restrict which tools the model can reach for. I caught this before publishing, but a less careful version of me would have shipped a workflow that claimed to enforce a boundary it did not enforce.
Tool errors arrive in user messages, not assistant messages. The SDK runs an internal loop. When the model emits a tool_use block in an assistant message, the SDK executes the tool and injects the result back into the stream as a tool_result block on the next user message. The is_error flag lives on tool_result. Watching tool_use blocks for errors is a silent no-op.
Both details below are wired correctly. If the official Workflow tool ships and handles this for you, you can stop thinking about either one.
The file
This is workflow.ts. About 390 lines, with dependencies kept to @anthropic-ai/claude-agent-sdk@0.3.150 and Zod. I type-checked this version locally on Node 22 with TypeScript strict mode and skipLibCheck enabled. The skipLibCheck caveat matters: the SDK package's own declaration files currently surface unresolved internal type references when library checking is enabled. The point is not to ship a framework. The point is to show what the inversion looks like when you can read the whole thing in one screen.
// workflow.ts
// Context Spine handoff: spec -> plan -> implement -> test -> review.
// Code owns the control flow. The model owns the judgment per step.
//
// Tested against @anthropic-ai/claude-agent-sdk@0.3.150 on Node 22.
// Pin your version. The SDK API has shifted in 2026.
import { query, type SDKMessage } from "@anthropic-ai/claude-agent-sdk";
import { execSync } from "node:child_process";
import { existsSync, mkdirSync, readFileSync, writeFileSync } from "node:fs";
import { resolve } from "node:path";
import { fileURLToPath } from "node:url";
import { z } from "zod";
// Spec contract: the Context Spine artifact
const SpecSchema = z.object({
id: z.string(),
title: z.string(),
intent: z.string(),
acceptanceCriteria: z.array(z.string()).min(1),
affectedPaths: z.array(z.string()),
constraints: z.object({
language: z.enum(["typescript", "python", "go"]),
testFramework: z.string(),
minCoverage: z.number().min(0).max(1).default(0.8),
buildCmd: z.string().default("npm run build"),
lintCmd: z.string().default("npm run lint"),
testCmd: z.string().default("npm test -- --coverage"),
}),
});
type Spec = z.infer<typeof SpecSchema>;
// Step result envelope
type StepResult<T> = {
step: string;
output: T;
tokensUsed: number;
durationMs: number;
};
type ValidationResult = { ok: true } | { ok: false; reason: string };
// Core step runner
async function runStep<T>(opts: {
name: string;
prompt: string;
systemPrompt: string;
allowedTools: string[];
cwd: string;
parse: (raw: string) => T;
validate: (parsed: T) => ValidationResult;
maxRetries?: number;
}): Promise<StepResult<T>> {
const start = Date.now();
let lastError = "";
let tokens = 0;
for (let attempt = 1; attempt <= (opts.maxRetries ?? 2); attempt++) {
const stream = query({
prompt: attempt === 1
? opts.prompt
: `${opts.prompt}\n\nPrevious attempt failed validation: ${lastError}. Fix and retry.`,
options: {
systemPrompt: opts.systemPrompt,
allowedTools: opts.allowedTools,
cwd: opts.cwd,
model: "opus",
// 'dontAsk' is the real restriction mode: anything not in allowedTools
// is denied without prompting. 'acceptEdits' only auto-approves; it does
// NOT restrict which tools the model can reach for.
permissionMode: "dontAsk",
maxTurns: 20,
},
});
let finalText = "";
let toolError: string | null = null;
for await (const msg of stream as AsyncIterable<SDKMessage>) {
// Tool errors come back as tool_result blocks inside user messages.
if (msg.type === "user") {
const content = msg.message.content;
if (Array.isArray(content)) {
for (const block of content as unknown as Array<Record<string, unknown>>) {
if (block.type === "tool_result" && block.is_error === true) {
toolError = `tool ${String(block.tool_use_id ?? "unknown")} errored`;
}
}
}
}
if (msg.type === "result") {
if (msg.subtype === "success") {
finalText = msg.result;
} else {
lastError = `SDK returned non-success: ${msg.subtype}`;
}
tokens += (msg.usage?.input_tokens ?? 0) + (msg.usage?.output_tokens ?? 0);
}
}
if (toolError && !finalText) {
lastError = toolError;
continue;
}
try {
const parsed = opts.parse(finalText);
const check = opts.validate(parsed);
if (check.ok) {
return { step: opts.name, output: parsed, tokensUsed: tokens, durationMs: Date.now() - start };
}
lastError = check.reason;
} catch (e) {
lastError = `parse failed: ${(e as Error).message}`;
}
}
throw new Error(`Step '${opts.name}' failed after retries: ${lastError}`);
}
// Shell helpers
function runShell(cmd: string, cwd: string, timeoutMs = 120_000): ValidationResult {
try {
execSync(cmd, { cwd, stdio: "pipe", timeout: timeoutMs });
return { ok: true };
} catch (e) {
const msg = (e as Error).message.slice(0, 1_000);
return { ok: false, reason: `'${cmd}' failed: ${msg}` };
}
}
function runShellText(cmd: string, cwd: string, timeoutMs = 120_000): { ok: true; stdout: string } | { ok: false; reason: string } {
try {
const stdout = execSync(cmd, { cwd, encoding: "utf8", stdio: "pipe", timeout: timeoutMs });
return { ok: true, stdout };
} catch (e) {
const msg = (e as Error).message.slice(0, 1_000);
return { ok: false, reason: `'${cmd}' failed: ${msg}` };
}
}
function lines(stdout: string): string[] {
return stdout.split("\n").map(line => line.trim()).filter(Boolean);
}
function changedFiles(cwd: string): { ok: true; files: string[] } | { ok: false; reason: string } {
const tracked = runShellText("git diff --name-only --diff-filter=ACMRTUXB", cwd);
if (!tracked.ok) return tracked;
const untracked = runShellText("git ls-files --others --exclude-standard", cwd);
if (!untracked.ok) return untracked;
return { ok: true, files: [...new Set([...lines(tracked.stdout), ...lines(untracked.stdout)])] };
}
// Step 1: Plan
const PlanSchema = z.object({
files: z.array(z.object({
path: z.string(),
action: z.enum(["create", "modify", "delete"]),
purpose: z.string(),
})),
approach: z.string(),
risks: z.array(z.string()),
});
type Plan = z.infer<typeof PlanSchema>;
async function planStep(spec: Spec, cwd: string) {
return runStep({
name: "plan",
cwd,
allowedTools: ["Read", "Grep", "Glob"],
systemPrompt: "You are a senior engineer producing implementation plans. Output ONLY valid JSON. No prose.",
prompt: `
Spec:
${JSON.stringify(spec, null, 2)}
Explore the codebase at ${cwd}. Produce a JSON plan:
{ "files": [{"path": "...", "action": "create|modify|delete", "purpose": "..."}],
"approach": "2-3 sentences",
"risks": ["..."] }
`.trim(),
parse: (raw) => PlanSchema.parse(JSON.parse(extractJson(raw))),
validate: (plan) => {
if (plan.files.length === 0) return { ok: false, reason: "empty file list" };
const planned = new Set(plan.files.map(f => f.path));
const covered = spec.affectedPaths.every(p => planned.has(p));
return covered ? { ok: true } : { ok: false, reason: "plan missing files from spec.affectedPaths" };
},
});
}
// Step 2: Implement
const ImplementSchema = z.object({
filesChanged: z.array(z.string()),
});
async function implementStep(spec: Spec, plan: Plan, cwd: string) {
return runStep({
name: "implement",
cwd,
allowedTools: ["Read", "Edit", "Write", "Bash"],
systemPrompt: "Implement the planned change. Touch only files in the plan. Match existing code style.",
prompt: `
Spec: ${spec.title}
Intent: ${spec.intent}
Plan:
${JSON.stringify(plan, null, 2)}
When done, output: {"filesChanged": ["path1", "path2"]}
`.trim(),
parse: (raw) => ImplementSchema.parse(JSON.parse(extractJson(raw))),
validate: (result) => {
const build = runShell(spec.constraints.buildCmd, cwd);
if (!build.ok) return build;
const lint = runShell(spec.constraints.lintCmd, cwd);
if (!lint.ok) return lint;
const actual = changedFiles(cwd);
if (!actual.ok) return actual;
const planned = new Set(plan.files.map(f => f.path));
const stray = actual.files.filter(f => !planned.has(f));
if (stray.length) {
return { ok: false, reason: `actual diff touched files outside plan: ${stray.join(", ")}` };
}
const reported = new Set(result.filesChanged);
const missingFromManifest = actual.files.filter(f => !reported.has(f));
if (missingFromManifest.length) {
return { ok: false, reason: `manifest omitted changed files: ${missingFromManifest.join(", ")}` };
}
return { ok: true };
},
});
}
// Step 3: Test
const TestSchema = z.object({ coverage: z.number().min(0).max(1) });
async function testStep(spec: Spec, cwd: string) {
return runStep({
name: "test",
cwd,
allowedTools: ["Read", "Edit", "Write", "Bash"],
systemPrompt: "Write tests covering every acceptance criterion. Use the project's existing test framework.",
prompt: `
Spec:
${JSON.stringify(spec, null, 2)}
Each acceptance criterion needs at least one test. Run the suite. Output: {"coverage": 0.NN}
`.trim(),
parse: (raw) => TestSchema.parse(JSON.parse(extractJson(raw))),
validate: (result) => {
const tests = runShell(spec.constraints.testCmd, cwd);
if (!tests.ok) return tests;
if (result.coverage < spec.constraints.minCoverage) {
return { ok: false, reason: `coverage ${result.coverage} below minimum ${spec.constraints.minCoverage}` };
}
return { ok: true };
},
});
}
// Step 4: Review
const EvidenceSchema = z.object({
file: z.string(),
lineStart: z.number().int().positive(),
lineEnd: z.number().int().positive(),
note: z.string(),
});
const ReviewSchema = z.object({
verdict: z.enum(["approve", "request_changes", "block"]),
findings: z.array(z.object({
severity: z.enum(["info", "warn", "blocker"]),
file: z.string(),
note: z.string(),
})),
acceptanceCoverage: z.array(z.object({
criterion: z.string(),
covered: z.boolean(),
evidence: z.array(EvidenceSchema).min(1),
})),
});
function validateEvidenceLocations(review: z.infer<typeof ReviewSchema>, cwd: string): ValidationResult {
for (const item of review.acceptanceCoverage) {
for (const evidence of item.evidence) {
const path = resolve(cwd, evidence.file);
if (!existsSync(path)) {
return { ok: false, reason: `evidence file does not exist: ${evidence.file}` };
}
if (evidence.lineEnd < evidence.lineStart) {
return { ok: false, reason: `invalid evidence range: ${evidence.file}:${evidence.lineStart}-${evidence.lineEnd}` };
}
const lineCount = readFileSync(path, "utf8").split("\n").length;
if (evidence.lineEnd > lineCount) {
return { ok: false, reason: `evidence range exceeds file length: ${evidence.file}:${evidence.lineStart}-${evidence.lineEnd}` };
}
}
}
return { ok: true };
}
async function reviewStep(spec: Spec, cwd: string) {
return runStep({
name: "review",
cwd,
allowedTools: ["Read", "Grep", "Glob", "Bash"],
systemPrompt: "Map every acceptance criterion to concrete evidence in the diff. Block on security or correctness issues.",
prompt: `
Spec:
${JSON.stringify(spec, null, 2)}
Run \`git diff\`. For each acceptance criterion, cite the file and line range that proves it.
Output JSON matching the review schema.
`.trim(),
parse: (raw) => ReviewSchema.parse(JSON.parse(extractJson(raw))),
validate: (review) => {
const uncovered = review.acceptanceCoverage.filter(c => !c.covered);
if (uncovered.length) return { ok: false, reason: `uncovered: ${uncovered.map(c => c.criterion).join("; ")}` };
const evidence = validateEvidenceLocations(review, cwd);
if (!evidence.ok) return evidence;
const blockers = review.findings.filter(f => f.severity === "blocker");
if (blockers.length) return { ok: false, reason: `${blockers.length} blocking findings` };
return { ok: true };
},
});
}
// Workflow orchestrator
export async function contextSpineWorkflow(specPath: string, cwd: string) {
const spec = SpecSchema.parse(JSON.parse(readFileSync(specPath, "utf8")));
const plan = await planStep(spec, cwd);
const impl = await implementStep(spec, plan.output, cwd);
const tests = await testStep(spec, cwd);
const review = await reviewStep(spec, cwd);
const summary = {
spec: spec.id,
plan: plan.output,
implementation: impl.output,
tests: tests.output,
review: review.output,
totalTokens: plan.tokensUsed + impl.tokensUsed + tests.tokensUsed + review.tokensUsed,
totalMs: plan.durationMs + impl.durationMs + tests.durationMs + review.durationMs,
};
mkdirSync(`${cwd}/.spine`, { recursive: true });
writeFileSync(`${cwd}/.spine/${spec.id}-result.json`, JSON.stringify(summary, null, 2));
return summary;
}
function extractJson(raw: string): string {
const match = raw.match(/```(?:json)?\s*([\s\S]*?)```/);
return (match ? match[1] : raw).trim();
}
async function main() {
const [, , specPath, repoPath] = process.argv;
if (!specPath || !repoPath) {
console.error("Usage: tsx workflow.ts <spec.json> <repo-path>");
process.exitCode = 1;
return;
}
const result = await contextSpineWorkflow(specPath, repoPath);
console.log(JSON.stringify(result, null, 2));
}
const isDirectRun = process.argv[1] === fileURLToPath(import.meta.url);
if (isDirectRun) {
main().catch((error: unknown) => {
console.error(error instanceof Error ? error.message : String(error));
process.exitCode = 1;
});
}
The spec, on disk
Every step reads from this file. It is the Context Spine. The MFA content below is an example contract, not a claim about a live endpoint.
{
"id": "AUTH-142",
"title": "Add MFA enrollment endpoint",
"intent": "Users can enroll in TOTP-based MFA from account settings",
"acceptanceCriteria": [
"POST /auth/mfa/enroll returns a TOTP secret and QR code URI",
"Secret is encrypted at rest in the user record",
"Endpoint requires authenticated session",
"Replay attempts are rejected"
],
"affectedPaths": ["src/auth/mfa.ts", "src/auth/mfa.test.ts"],
"constraints": {
"language": "typescript",
"testFramework": "vitest",
"minCoverage": 0.85,
"buildCmd": "npm run build",
"lintCmd": "npm run lint",
"testCmd": "npm test -- --coverage"
}
}
That file is the artifact you version. It is the artifact you review. It is the artifact a future engineer or a future agent reads to understand what the system was supposed to do. The code that comes out of the workflow is the regenerable output. The spec is the durable input.
This is the practical answer to the question I get from leaders thinking about this pattern: "What do we actually own when the AI writes the code?" You own the spec. You own the standards. You own the workflow. The code is what falls out.
Running it
npm install @anthropic-ai/claude-agent-sdk@0.3.150 zod
npm install -D tsx typescript @types/node
npm pkg set type=module
export ANTHROPIC_API_KEY=...
npx tsx workflow.ts ./specs/AUTH-142.json ./repo
To type-check the file the same way I did:
npx tsc --noEmit --strict --skipLibCheck --module NodeNext --moduleResolution NodeNext --target ES2022 --types node workflow.ts
The result lands in .spine/AUTH-142-result.json. Plan, implementation manifest, test coverage, review verdict, total tokens, total wall clock. The whole workflow becomes inspectable.
Where I would push back on my own design
Two things I would flag if I were reviewing this in a code review.
The retry loop is still naive: Two attempts by default. The truncated stderr helps, but a production version bounds retry with progressively narrower scope and falls back to opening a ticket for human review instead of throwing. The first version of any workflow should fail hard. The second version should fail gracefully. I am on version one.
There is no checkpoint: If the implement step fails after the plan succeeded, the plan is lost. A production version writes the plan to disk before the implement step starts, so a re-run can resume from the last successful checkpoint. For a single-engineer use case, restarting is cheap. For a CI pipeline, it is not.
Both are solvable. Both are the kind of engineering additions that turn a prototype into something you would run inside a regulated engineering organization.
Where the workflow does not belong
This pattern earns its keep on knowable, repeatable work. It does not belong on open-ended exploration.
If I am debugging a production incident, I want the agent loop. The path is not known. I need the model to decide what to look at next based on what just happened. A workflow with predefined steps would slow me down.
If I am implementing a spec with clear acceptance criteria, I want the workflow. The path is known. I want code to enforce the boundaries and validators to fail closed when something drifts.
Both patterns belong in the same toolkit. The mistake is treating either one as the default for all engineering work.
What changes when the official Workflow tool ships
If Anthropic ships the Workflow tool hinted at in the removed changelog line, it should be better than what I built. When an official version lands, three things change.
The orchestrator code in this post becomes easier to replace. An official tool should handle more of the agent loop, structured outputs, and retry semantics natively. What stays relevant is the Zod-validated spec, the deterministic validators, and the tool-scoping discipline. Those are the parts you would have to design either way.
The integration surface gets cleaner. I would expect an official tool to hook into the Claude Code permission model, the existing subagent infrastructure, and the session resume machinery. A homegrown workflow.ts does none of that. The work you do today to define your spec schema and your validators carries forward. The orchestration layer underneath does not.
The pattern gets easier to defend with platform and security partners. Saying "I use the official tool with a custom spec schema" is a much cleaner conversation than "I rolled my own orchestrator." If you are inside a regulated organization, waiting for the official abstraction may be worth it.
If you are not, build the rough version now. It is the fastest way to learn what you want from the polished one.
What changes Monday
Five steps in order.
- Pick one workflow that already runs by hand: A spec to PR loop. A weekly FinOps review. A compliance evidence gather. Something repeatable.
- Write the spec contract in Zod: What is the input shape? What is the output shape? What are the deterministic validators between steps? If you cannot answer those questions, the work is not ready for a workflow yet.
- Write the smallest possible orchestrator: Two steps, not four. One validator each. Run it once by hand.
- Add a third step only when the absence of one hurts: This follows the exact same rule as with specialist agents. Build the role the moment you catch yourself wishing for one.
- Log the spine: Every run writes a result file with the spec, the steps, and the validator outcomes. That file is the audit trail and the postmortem.
The agent loop is a great default for personal productivity. It is a terrible default for production engineering. The workflow inversion is how the same model becomes governable, predictable, and cheap enough to run inside the systems we already trust.
The model still did the judgment work. The code decided what counted as done.
That distinction is what I will be defending in every conversation about this pattern for the next year.
The full working code, the example spec, and the type-check configuration live in the companion repo: github.com/vscarpenter/context-spine-workflow-js. Clone it, run it, and break it. That is the fastest way to feel the inversion for yourself.
