The Frame Is the Bottleneck

My thesis is that as agents get better at execution, the bottleneck moves upstream to framing.

Last week, Claude Code handed me a credible-looking refactor for a service I had been refactoring for months. The plan was tight. The diff was clean. The tests passed. I read it twice and killed it. The model had solved the problem I wrote down, not the problem we actually had.

That gap, the one between the prompt and the situation, is the whole game. Multiply it across lots of engineers, a multi-cloud footprint, and a growing portfolio of agentic workflows, and you have my Monday morning.

Dan Shipper recently published a long essay called After Automation. His thesis: AI progress creates more work for humans, not less. Every, the media and software company behind a daily newsletter on what comes next in technology, is a 30-person team running Codex, Claude Code, and Cowork across every function, and their headcount keeps climbing. The judgment load is heavier than it was before they automated anything.

I want to extend a few of his arguments from the vantage point of a Fortune 100 platform org. The picture changes when you scale from 30 people to a few hundred engineers operating across AWS, Azure, and a vendor ecosystem that will not slow down for anyone.

Two modes, and the one most leaders miss

Shipper splits agent work into two modes.

The first is the obvious one: agent employees. These are bots in Slack with names and jobs, or embedded agents like Fin handling basic support conversations.

The second is human-agent collaboration inside tools like Codex and Claude Code, where you and the agent share the same workspace, the same browser, the same files, and the same problem in real time.

The first mode is what most executives picture when they hear "AI strategy." The second mode is where much of the actual leverage lives, and most leaders are still underinvesting in it.

I run my whole day this way now. Spec, plan, review, refine. The agent does the typing. I do the framing.

By frame, I do not mean a better prompt. I mean the explicit shape of the work: constraints, invariants, acceptance criteria, risks, non-goals, and the situational judgment a senior engineer usually carries in their head.

Kieran Klaassen at Every calls this the human sandwich: a human sets the frame, the AI collapses the task, and a human judges and extends the result. The sandwich is not just a metaphor. It is a workflow shape.

The top slice is a written spec and a defined acceptance test. The middle is one or more agent runs against that spec. The bottom slice is a human review that either ships the work or feeds the gap back into the spec for the next run.

It is the same shape I describe in Spec, Standards, Specialists. It is also the shape my engineers are increasingly running every day.

If your org is mostly optimizing for "give every employee a chatbot," you are probably underinvesting in the mode that matters most.

Jevons, again

Shipper does not name Jevons Paradox, but much of his argument behaves like one.

Jevons Paradox: When a technology makes a resource cheaper or more efficient to use, total consumption of that resource often rises instead of falls.

Make a resource cheaper and consumption goes up, not down. AI makes yesterday's expert competence cheaper. Demand expands, and that expansion creates new demand for the next layer of judgment above it.

The pattern is familiar. Compilers did not eliminate programmers. Cloud did not eliminate operations. Kubernetes did not eliminate platform engineers. Each layer commoditized the layer below and created appetite for expertise at the layer above.

What does this look like at scale? When writing code becomes effectively free, enterprise output doesn't stay flat with fewer people. Instead, the organization's appetite for software explodes. We suddenly tackle the deep architectural debt, the niche automation pipelines, and the complex integration layers we used to defer.

What is different this time is the speed. The cycle that used to take a decade now runs in quarters. The benchmark you ran in January is saturated by July. The frame you designed in spring is the new floor by fall.

The job is no longer to master a stack and wait for the next abstraction wave. The job is to keep framing faster than the frame collapses. Models climb frames. Leaders build them.

The personal agent trap

This is one of the most important parts of Shipper's essay, and I suspect many leaders will skim past it. Every gave every employee a personal agent. Then they pulled it back.

Personal agents go stale fast because the humans they work with stop maintaining them. Every now runs agents at the team and org level, with a dedicated AI engineering function keeping them healthy.

This is not a minor implementation detail. It is the difference between an AI initiative that compounds and one that quietly dies in 18 months.

Agents are software. Software needs owners, telemetry, evals, on-call, and a backlog. Treat them like Lambda functions you forgot you deployed, and they will rot like Lambda functions you forgot you deployed.

The unit of ownership has to be a team or a platform, not an individual.

If an agent matters to the business, it needs a product owner. It needs a health model. It needs evaluation data. It needs a path for change. It needs someone who gets paged, literally or figuratively, when it starts making expensive nonsense with confidence.

I have been saying this for a year, and Every just gave us the receipts.

Smuggled intelligence and the Context Spine

Shipper introduces a concept he calls smuggled intelligence: the hidden layer of human judgment baked into every prompt that makes a benchmark result possible.

He dissects an OpenAI GDPval prompt that asks a model to perform audit sampling on financial crime metrics. The prompt is a wall of constraints: confidence levels, tolerable error rates, specific entities, risk weightings, and formatting requirements. By the time the model touches the problem, an expert has already done much of the hard work.

The benchmark measures the model. The score reflects the framer.

This is also why benchmark watching is a trap. Shipper calls it chart psychosis, the habit of extrapolating curves into sweeping predictions while ignoring that every benchmark lives inside a chosen frame. Saturate the frame. Move the frame. Watch the score collapse. Restart the cycle.

The operational answer is not to ignore benchmarks, which can be useful signals. The mistake is treating external benchmarks as your evaluation strategy. You need your own evals against your own work. You need to version them like code. You need to let them decay on a known schedule, because your work changes, your tools change, and your risks change.

I have been calling the artifact that carries this judgment a Context Spine, an idea I first laid out in The Agentic SDLC.

A Context Spine is a versioned markdown specification that lives directly in the root of a repository, traveling with the work through every agent run, and getting reviewed via pull requests like any other production artifact. It acts as the explicit vehicle for making smuggled intelligence reusable and legible.

To see the difference, look at how the same objective changes when the frame is made explicit.

A weak frame says:

Refactor this service to simplify the retry logic.

A strong frame—the kind that populates a Context Spine—says:

# Context Spine: Core Ingestion Service Retry Refactor

## Constraints & Invariants
- Maintain strict idempotency guarantees across all standard consumer endpoints.
- Do not increase downstream connection pressure during partial database outages.

## Acceptance Criteria
- Implement exponential backoff with jitter for transient 503 errors.
- Include explicit integration tests for duplicate message handling under high concurrency.

## Non-Goals
- Do not modify the underlying message schema or alter the dead-letter queue routing logic.

The first prompt asks for cleaner code. The second protects the system. That is the difference between asking an agent to finish a task and giving it enough context to help safely change a production system.

The Context Spine lets a senior engineer's framing scale to a hundred agent runs without losing the thread. It also lets you audit, evaluate, and improve the frame itself, because the frame is now a first-class development artifact, not a Slack thread, a vague hallway conversation, or a comment buried in a pull request.

Spec-driven multi-agent development is not a workflow preference. It is the discipline of making your framing legible enough to compound.

Governance without ceremony

There is a danger here, especially in large enterprises. Leaders will hear "treat agents like software" and respond with a 47-page governance process, three approval boards, and a SharePoint site that somehow makes everyone sad.

That is not the point.

The point is lightweight ownership, visible risk, measurable quality, and fast feedback. How do we treat agents like production software without drowning in ceremony? By automating the guardrails.

Instead of manual review boards, we use automated evaluation harnesses embedded directly into our CI/CD pipelines. If a team modifies an agent or updates a Context Spine, the changes run against a deterministic suite of regression tests and simulation loops. The pipeline acts as the gatekeeper, verifying that the agent operates safely within its boundaries before it ever interacts with production logic.

Not every agent interaction needs a full Context Spine. Asking for a regex, a shell command, or a quick explanation does not require ceremony. Nobody needs an architecture review board for "what does this awk command do?" We have all suffered enough.

But the moment an agent touches production logic, customer experience, security posture, data movement, regulatory exposure, or architectural direction, the frame needs to become durable and verifiable.

The question is not, "Did we use AI?" The question is, "Can we explain and automatically verify the frame the AI operated inside?"

If the answer is no, you do not have an AI strategy. You have vibes with a diff.

What this means if you lead engineers

Retire the personal-agent fantasy. Stop promising every employee a custom AI sidekick as the centerpiece of your strategy. Encourage experimentation, but focus your energy on building durable agents that serve teams and systemic workflows. Fewer agents, better maintained, with clear organizational owners. This is the contrarian move most enterprises will get wrong, and it is the one that protects your investment.
Invest in framers, not finishers. The scarce skill is no longer typing the code. It is shaping the problem so the agent can finish it well. Promote, hire, and develop people who can sit at the front of the human sandwich. They are your primary leverage point.
Treat agents as platform infrastructure. Evals, simulation harnesses, prompt libraries, skill packages, and CI hooks belong in your platform operating model, not scattered across hobbyist teams. Stand up an AI engineering capability with clear ownership, support expectations, and a structured roadmap. Apply the same discipline you use for your internal developer platform to the agents now changing your systems.
Measure systemic leverage, not headcount. The old metric was engineers per team. The better metric is throughput per framer, with quality, production stability, and rework factored in. If your reporting still rewards large headcounts over high-leverage teams shipping safely at velocity, you will starve the operating model that actually works.
Build the Context Spine. Make your specs durable, versioned, and machine-readable. Treat the frame as a first-class development artifact. Your future self, and every agent run downstream of you, will thank you.

Coda

Shipper closes with a story about a man who writes down where he put his clothes so he can find them in the morning. The next day he finds his cap, his pants, and his shoes. Then he asks: where am I myself?

That is the question every leader has to keep asking as the models get better.

The models will find your cap. They will find your pants. They will find every part of your work that has become legible enough to write down. The part of you that decides what matters today, in this codebase, for this customer, in this market, is the part the frame depends on.

The bottleneck is not the stack. It is the frame.

The Frame Is the Bottleneck

Two modes, and the one most leaders miss

Jevons, again

The personal agent trap

Smuggled intelligence and the Context Spine

Governance without ceremony

What this means if you lead engineers

Coda

Everyone Knows You Never Rewrite

Trust the Gate, Not the Actor

The Kill Switch Was Always There

The Frame Is the Bottleneck.

Two modes, and the one most leaders miss

Jevons, again

The personal agent trap

Smuggled intelligence and the Context Spine

Governance without ceremony

What this means if you lead engineers

Coda

Everyone Knows You Never Rewrite

Trust the Gate, Not the Actor

The Kill Switch Was Always There

The Frame Is the Bottleneck