Last week, Claude Code handed me a credible-looking refactor for a service I had been refactoring for months. The plan was tight. The diff was clean. The tests passed. I read it twice and killed it. The model had solved the problem I wrote down, not the problem we actually had.
That gap, the one between the prompt and the situation, is the whole game. Multiply it across lots of engineers, a multi-cloud footprint, and a growing portfolio of agentic workflows, and you have my Monday morning.
Dan Shipper recently published a long essay called After Automation. His thesis: AI progress creates more work for humans, not less. Every is a 30-person team running Codex, Claude Code, and Cowork across every function, and their headcount keeps climbing. The judgment load is heavier than it was before they automated anything.
I want to extend a few of his arguments from the vantage point of a Fortune 100 platform org. The picture changes when you scale from 30 people to a few hundred engineers operating across AWS, Azure, and a vendor ecosystem that will not slow down for anyone.
Two modes, and the one most leaders miss
Shipper splits agent work into two modes.
The first is the obvious one: agent employees. These are bots in Slack with names and jobs, or embedded agents like Fin handling basic support conversations.
The second is human-agent collaboration inside tools like Codex and Claude Code, where you and the agent share the same workspace, the same browser, the same files, and the same problem in real time.
The first mode is what most executives picture when they hear "AI strategy." The second mode is where much of the actual leverage lives, and most leaders are still underinvesting in it.
I run my whole day this way now. Spec, plan, review, refine. The agent does the typing. I do the framing.
By frame, I do not mean a better prompt. I mean the explicit shape of the work: constraints, invariants, acceptance criteria, risks, non-goals, and the situational judgment a senior engineer usually carries in their head.
Kieran Klaassen at Every calls this the human sandwich: a human sets the frame, the AI collapses the task, and a human judges and extends the result. The sandwich is not just a metaphor. It is a workflow shape.
The top slice is a written spec and a defined acceptance test. The middle is one or more agent runs against that spec. The bottom slice is a human review that either ships the work or feeds the gap back into the spec for the next run.
It is the same shape I describe in Spec, Standards, Specialists. It is also the shape my engineers are increasingly running every day.
If your org is mostly optimizing for "give every employee a chatbot," you are probably underinvesting in the mode that matters most.
Jevons, again
Shipper does not name Jevons Paradox, but much of his argument behaves like one.
Jevons Paradox: When a technology makes a resource cheaper or more efficient to use, total consumption of that resource often rises instead of falls.
Make a resource cheaper and consumption goes up, not down. AI makes yesterday's expert competence cheaper. Demand expands, and that expansion creates new demand for the next layer of judgment above it.
The pattern is familiar. Compilers did not eliminate programmers. Cloud did not eliminate operations. Kubernetes did not eliminate platform engineers. Each layer commoditized the layer below and created appetite for expertise at the layer above.
What is different this time is the speed. The cycle that used to take a decade now runs in quarters. The benchmark you ran in January is saturated by July. The frame you designed in spring is the new floor by fall.
The job is no longer to master a stack and wait for the next abstraction wave. The job is to keep framing faster than the frame collapses. Models climb frames. Leaders build them.
The personal agent trap
This is one of the most important parts of Shipper's essay, and I suspect many leaders will skim past it. Every gave every employee a personal agent. Then they pulled it back.
Personal agents go stale fast because the humans they work with stop maintaining them. Every now runs agents at the team and org level, with a dedicated AI engineering function keeping them healthy.
This is not a minor implementation detail. It is the difference between an AI initiative that compounds and one that quietly dies in 18 months.
Agents are software. Software needs owners, telemetry, evals, on-call, and a backlog. Treat them like Lambda functions you forgot you deployed, and they will rot like Lambda functions you forgot you deployed.
The unit of ownership has to be a team or a platform, not an individual.
If an agent matters to the business, it needs a product owner. It needs a health model. It needs evaluation data. It needs a path for change. It needs someone who gets paged, literally or figuratively, when it starts making expensive nonsense with confidence.
I have been saying this for a year, and Every just gave us the receipts.
Smuggled intelligence and the Context Spine
Shipper introduces a concept he calls smuggled intelligence: the hidden layer of human judgment baked into every prompt that makes a benchmark result possible.
He dissects an OpenAI GDPval prompt that asks a model to perform audit sampling on financial crime metrics. The prompt is a wall of constraints: confidence levels, tolerable error rates, specific entities, risk weightings, and formatting requirements. By the time the model touches the problem, an expert has already done much of the hard work.
The benchmark measures the model. The score reflects the framer.
This is also why benchmark watching is a trap. Shipper calls it chart psychosis, the habit of extrapolating curves into sweeping predictions while ignoring that every benchmark lives inside a chosen frame.
Saturate the frame. Move the frame. Watch the score collapse. Restart the cycle.
The operational answer is not to ignore benchmarks. They can be useful signals. The mistake is treating external benchmarks as your evaluation strategy. You need your own evals against your own work. You need to version them like code. You need to let them decay on a known schedule, because your work changes, your tools change, and your risks change.
I have been calling the artifact that carries this judgment a Context Spine: a versioned markdown spec that lives in the repo, travels with the work through every agent run, and gets reviewed like any other production artifact.
It encodes the invariants, the acceptance criteria, the known traps, and the explicit non-goals. The spec is the smuggled intelligence, made explicit and reusable.
A weak frame says:
Refactor this service to simplify the retry logic.
A better frame says:
Refactor retry behavior without changing idempotency guarantees, without increasing downstream pressure during partial outages, and with explicit tests for duplicate message handling.
The first prompt asks for cleaner code. The second protects the system.
That is the difference between asking an agent to finish a task and giving it enough context to help safely change a production system.
The Context Spine lets a senior engineer's framing scale to a hundred agent runs without losing the thread. It also lets you audit, evaluate, and improve the frame itself, because the frame is now a first-class artifact, not a Slack thread, a vague hallway conversation, or a comment buried in a pull request.
If smuggled intelligence is the problem Shipper names, the Context Spine is one answer to it.
Spec-driven multi-agent development is not a workflow preference. It is the discipline of making your framing legible enough to compound.
Governance without ceremony
There is a danger here, especially in large enterprises.
Leaders will hear "treat agents like software" and respond with a 47-page governance process, three approval boards, and a SharePoint site that somehow makes everyone sad.
That is not the point.
The point is lightweight ownership, visible risk, measurable quality, and fast feedback. The best agent operating models will look less like annual review boards and more like CI/CD: small checks, close to the work, running continuously.
Not every agent interaction needs a full Context Spine. Asking for a regex, a shell command, or a quick explanation does not require ceremony. Nobody needs an architecture review board for "what does this awk command do?" We have all suffered enough.
But the moment an agent touches production logic, customer experience, security posture, data movement, regulatory exposure, or architectural direction, the frame needs to become durable.
The question is not, "Did we use AI?"
The question is, "Can we explain the frame the AI operated inside?"
If the answer is no, you do not have an AI strategy. You have vibes with a diff.
What this means if you lead engineers
Retire the personal-agent fantasy. Stop promising every employee a custom AI sidekick as the centerpiece of your strategy. Encourage experimentation, but build durable agents that serve teams and workflows. Fewer agents, better maintained, with clear owners. This is the contrarian move most enterprises will get wrong, and it is the one that protects your investment.
Invest in framers, not finishers. The scarce skill is no longer typing the code. It is shaping the problem so the agent can finish it well. Promote, hire, and develop people who can sit at the front of the human sandwich. They are the leverage point.
Treat agents as platform. Evals, harnesses, prompt libraries, skill packages, and CI hooks belong in your platform operating model, not scattered across hobbyist teams. Stand up an AI engineering capability with clear ownership, support expectations, and a roadmap. Apply the same discipline you use for your internal developer platform to the agents now changing your systems.
Measure leverage, not headcount. The old metric was engineers per team. The better metric is throughput per framer, with quality and rework included. If your reporting still rewards big teams over small teams shipping more safely, you will starve the model that actually works.
Build the Context Spine. Make your specs durable, versioned, and machine-readable. Treat the frame as a first-class artifact. Your future self, and every agent run downstream of you, will thank you.
Coda
Shipper closes with a story about a man who writes down where he put his clothes so he can find them in the morning. The next day he finds his cap, his pants, and his shoes. Then he asks: where am I myself?
That is the question every leader has to keep asking as the models get better.
The models will find your cap. They will find your pants. They will find every part of your work that has become legible enough to write down. The part of you that decides what matters today, in this codebase, for this customer, in this market, is the part the frame depends on. The bottleneck is not the stack. It is the frame.
