Spec, Standards, Specialists: How I Actually Build with AI

The strategy posts say AI software development is a system. This is the practitioner layer underneath that idea.

I have written a lot lately about the strategy of AI-assisted development: the dark factory model for the mental picture, the Agentic SDLC for how the tool sprawl resolves into a pipeline, and Claude Code as a build system for how the daily tooling has to be configured.

This post is the working loop underneath those ideas.

It is the workflow I actually run on real features: write a spec, load the project standards, hand the brief to Claude Code, and let a coordinated set of specialist agents work inside the guardrails.

The important part is not that the AI writes code. The important part is that the system has enough context, standards, and review pressure to produce code I can trust.

That is the shift. AI-assisted development is no longer just a better chat window. It is becoming an engineering workflow.

That is the whole shape:

Layer	Artifact	Purpose	Failure mode if missing
Intent	`tasks/spec.md`	Defines what we are building, for whom, and how we will know it works	The system confidently builds the wrong thing
Context	`CLAUDE.md`	Tells the system where the work lives and how the project operates	The output looks like a stranger wandered in with a keyboard and confidence
Standards	`coding-standards.md`	Defines the non-negotiables for quality, security, accessibility, testing, and maintainability	The code compiles, passes light review, and quietly drifts
Execution	Specialist agents	Breaks work into focused planning, building, review, docs, and closing loops	One giant session tries to be good at everything and misses important details

A quick definition before we go further. I use "AI-assisted" to mean a developer typing prompts into a chat window and reviewing each suggestion. I use "agentic" to mean a system where the model plans work, dispatches parts of that work to other models, and operates inside guardrails I designed ahead of time. The difference is who holds the steering wheel between turns. In an AI-assisted setup, you do. In an agentic one, the system does more of the driving, and you set the rails.

I am writing for two readers at once. If you are an engineering leader trying to understand what an AI-native delivery model could look like at team scale, this workflow is one of the units of work you may need to industrialize. If you are a senior engineer who wants a working blueprint, this is the loop, the inputs, and the costs, without pretending the rough edges are gone.

The leadership read sits on top of the practitioner one because the leader's job here is not to invent a new operating model from scratch. It is to recognize the one some engineers are already running and scale it intentionally.

Three inputs. One workflow. A loop I keep tightening.

The workflow

A representative example

Before the mechanics, here is what this looks like in practice.

On a recent side project, I was working through what looked like a small settings feature. The first instinct was to treat it like a simple UI change: add a control, wire it to persistence, update the state, move on.

The spec review slowed me down in the best way.

Claude came back with questions I should have asked myself first:

What happens if the stored value is missing or malformed?
Should the setting sync across devices or remain local?
What is the default for a returning user?
Does the UI need to announce the change for assistive technologies?
What should the user see if the save fails?

The original version in my head had one acceptance criterion: the setting can be changed and saved. After the spec review, the feature looked different.

Before spec review	After spec review
Setting can be changed and saved	Default behavior is defined
Basic persistence	Malformed stored values are handled
Happy-path UI control	Accessibility announcements are included
Save action works	Save failure behavior is specified
State updates locally	State restoration is tested

None of those questions required genius. They required patience. That is exactly the kind of patience a good workflow can force.

The planner split the work into smaller tasks than I would have written on my own. The builder wrote the failing tests first. The reviewer caught an accessibility gap in the control behavior. Then /qcheck flagged a mismatch between the implementation and the acceptance criteria. None of the findings was dramatic. That was the point.

The workflow caught small problems while they were still cheap. It does not make the work effortless. It makes the work harder to lie about.

The three inputs

Every feature I build with AI starts with the same three artifacts. None of them is the code.

A specification. What we are building, why, for whom, with what constraints, and how we will know it works. The spec is the contract. If it is vague, everything downstream is vague.

Technical preferences. The platforms, languages, frameworks, and runtime targets this project lives inside. SwiftUI on a recent macOS. Bun and TypeScript with React. Java 21 with Spring Boot. AWS-native or on-device foundation models. The AI does not need to guess. It needs to know.

Coding and design standards. The non-negotiables. File size limits, type safety rules, accessibility baselines, security defaults, test discipline, naming, error handling. The body of work I packaged into a reusable Claude Skill so my standards travel with me from project to project.

These three inputs are the brief. Everything else is execution.

The order matters. Specs without standards produce plausible code that fails review. Standards without a spec produce well-structured code that solves the wrong problem. Preferences without either produce something that compiles, ships, and breaks in production three weeks later.

All three, together, are the minimum viable brief for an agentic workflow. Skip any one of them and the loop downstream gets noisier in proportion.

Step 1: write the spec with Claude, not for Claude

Here is the first place many people get this wrong. They open Claude, type a paragraph that starts with "build me a feature that," and call that a spec.

That is a prompt. A prompt is not a contract.

I use Claude as my drafting partner for the specification itself. I give it a rough description of the problem, the user, and the outcome I want. It comes back with a structured draft: goals, inputs and outputs, constraints, edge cases, out of scope, acceptance criteria, and open questions.

The open questions are the most valuable part. They are the holes in my own thinking, surfaced before I write a line of code.

Most of my first-draft specs come back with a healthy list of open questions. A handful are genuinely unresolved product decisions. A few are technical assumptions I had not validated. The rest are edge cases I had not considered.

I answer them, the spec firms up, and what looked like a one-week feature often turns into a clearer two-week feature with sharper acceptance criteria. That is a win, even though the timeline grows. A vague feature shipped fast is a bug factory in slow motion.

The shape of a working spec is boring on purpose:

Goal. One paragraph. What changes for the user.
Inputs and outputs. What the feature consumes and produces.
Constraints. Performance, privacy, platform, dependencies, and time.
Edge cases. What we expect to fail and how it should fail.
Out of scope. What we are deliberately not doing.
Acceptance criteria. Testable statements, one per behavior.
Open questions. Things we still owe ourselves an answer on.

Claude is good at producing this structure. I am good at deciding what belongs in each box. The collaboration is asymmetric, and that is the point. The model brings completeness. I bring judgment.

I keep the final spec in tasks/spec.md in the repo. Every later step in the workflow reads from it. The spec is not a one-time artifact, either. It evolves as I learn during the build. When I discover a new edge case, it goes into the spec, not just the code.

The spec is the source of truth. The code is what falls out of it.

This is the discipline that vibe coding skips. Vibe coding is fun. It produces something that looks like progress in the first hour. It can also produce something that looks like a tire fire in the third week.

The spec is the antibody.

Step 2: pair the spec with preferences and standards

Once the spec is real, the next step is to make sure the AI has the rest of the brief. This is where the workflow stops being "talk to Claude" and starts being "load the project's working memory."

Three files do the work.

tasks/spec.md holds the specification. The thing we just wrote.

CLAUDE.md holds the technical preferences and project conventions. Roughly 200 lines, project-specific, loaded into every session automatically. What stack, what versions, what file layout, what to never touch, and what local commands matter.

This is the file that tells the AI you use Bun, not npm. That migrations live under db/migrations and require a separate review path. That the design system lives in packages/ui and nothing else can hand-roll a button. That safeParse is preferred to parse on user input.

The boring details matter. They determine whether the output looks like it belongs in your codebase or like a stranger wandered in with a keyboard and confidence.

coding-standards.md holds the long-form standards. The same standards document I distilled into a reusable skill so my standards travel with me from project to project. File size limits, type safety, security defaults, accessibility baselines, testing discipline, and the red flags list.

The skill version makes this portable. The repo copy makes it concrete. They are not redundant. The skill is the canonical source. The repo copy is the version this project agreed to follow when the work started.

Together, those files answer three different questions:

The spec answers what we are building.
The preferences answer where it has to live.
The standards answer how it must be built.

This is the layer many teams underinvest in. They have specs of varying quality. They have stacks. They may even have standards scattered across wiki pages, pull request comments, and the collective memory of three senior engineers who everyone keeps interrupting on Slack.

That is not enough for an AI workflow.

The AI needs the standards in a form it can actually use. Otherwise, it defaults to a generic best-guess style that looks fine and fails review. The fix is not better prompting. The fix is doing the writing once so every session inherits it.

If I had to pick the single highest-leverage day of work for a team adopting this kind of workflow, it would be the day they sit down and write the standards file. Not the day they pick a model. Not the day they buy seats. The day they argue about the rules and write them down.

Everything else compounds on that.

Step 3: hand the brief to Claude Code, then let the specialists work

This is the part that surprises people the first time they see it.

I do not ask Claude Code to build the feature. I ask the main session to read the brief, plan the work, and dispatch specialists to do it.

In this workflow, a specialist is not a person. It is a narrowly scoped agent configuration with a role, a prompt, a tool boundary, and a definition of done.

The fan-out usually looks something like this, depending on the feature:

A planner session reads the spec, the preferences, and the standards. It produces a task breakdown, a dependency order, and a definition of done. The output goes to tasks/todo.md.
A builder subagent takes one task at a time. It writes the failing test first, then the minimum implementation, against the standards. Strict red, green, refactor.
A reviewer subagent runs after each meaningful change. For component work, it is the accessibility reviewer. For data work, it is the migration reviewer. For security-sensitive paths, it is a focused audit subagent. Read-only, narrow tools, explicit findings.
A docs subagent updates README files, ADRs, and inline docs as the implementation lands. Documentation as the work happens, not as a guilt trip at the end.
A closer session runs /qcheck, my skeptical staff-engineer review command, against everything that changed in the session. It compares the diff to the spec and the standards and flags anything that drifted.

Each specialist has one job, a tight tool allowlist, and its own context window. The main session does not have to hold every domain in its head, because the specialists do that work.

This is principle three of the build system view: specialists beat generalists. It is also the practical answer to the tool sprawl problem. The pipeline is not one giant agent. It is a coordinated multi-agent workflow with clear interfaces.

I sit in the loop, not above it. I review the planner's task list before the builder runs. I read the reviewer's findings before merging. I am not typing the code. I am steering the system that types the code.

That is the dark factory in practice. Lights-on at the inputs and outputs. Lights-off in the middle. My job is to design the line and check the parts coming off it, not to weld every joint by hand.

The failure mode this pattern helps solve is hidden quality drift.

The naive version of an agentic workflow is one large session that holds everything in its head: the spec, the standards, the migration rules, the accessibility rules, the security checklist, and the current task. That session can do pretty good work across ten domains. The output compiles. The tests pass. The code looks fine.

You do not notice the cost until the third regression, the missed accessibility issue, or the migration that locked the database.

A specialist does better work in one domain because its context window is not polluted by the other nine. Multi-agent workflows are not always faster than a single big session in any one moment. They are sharper across many moments.

That is the trade.

The guardrails matter more than the agents

The agent setup only works because the permissions are intentionally boring.

Reviewers are read-only. Builders get narrow tool access. Destructive operations require human approval. Secrets stay out of prompts. Hooks log the commands that matter. The workflow creates a trail of what changed, what ran, and what the system decided.

This is not bureaucracy. It is what lets the workflow move quickly without pretending the model deserves unlimited trust.

That matters even more at team scale. A personal workflow can survive some rough edges because the person running it knows where the sharp corners are. A team workflow needs those edges made visible.

The question is not, "How do we make the agents more powerful?" The better question is, "What permissions, context, and review pressure let the agents be useful without becoming reckless?"

That is a much healthier place to start.

Where I still require human judgment

The workflow delegates work. It does not delegate accountability.

There are specific points where I still want a human in the loop:

Before accepting the spec.
Before approving the planner's task list.
Before letting implementation proceed after the failing test.
Before accepting or overriding reviewer findings.
Before merging the pull request.
Before changing standards that other sessions or teams inherit.

Those checkpoints are not ceremony. They are the places where judgment matters most.

The model can propose. The system can execute. The specialists can review. I still own the decision.

That distinction matters. The more capable the tooling becomes, the easier it is to confuse fluency with trustworthiness. Polish is not correctness.

What this actually looks like end to end

A single working session usually starts with the same rhythm.

I open the project. CLAUDE.md and the standards skill load automatically. I open tasks/spec.md, refine the next slice of work with Claude in conversation, and commit the updated spec.

I run a planner prompt that reads the spec and the standards and produces a task list. I review it, push back on anything that feels off, and commit tasks/todo.md. The planning step is short, but it protects the build from drifting. A plan I disagreed with on Tuesday will produce code I disagree with on Thursday.

Then I run /tdd against the first task. The builder writes a failing test. I read the test, approve it, and the builder writes the minimum implementation. The accessibility reviewer fires automatically on changed components. I read its findings. The post-tool-use hook runs a fast type check. The audit hook logs the bash commands. The decision hook captures the diff to the daily log.

I move to the next task. Same loop. The docs subagent updates the README as features land. The reviewer subagents catch the things I would miss at midnight. When something bites me twice, the lesson goes into tasks/lessons.md so the next session inherits the scar tissue.

When the slice is done, I run /qcheck. It reviews every changed file against the standards, flags drift, and tells me what is not yet ready to ship. I fix the flagged items. I push. CI runs the heavy suite. The pull request opens with a generated description that pulls from the spec and the diff.

I published a small companion repo with the artifacts from this post at github.com/vscarpenter/multi-agent-workflow. It includes the spec, project memory, standards, subagents, commands, hooks, planner output, and one completed vertical slice so you can see how the scaffolding turns into code. The repo is not meant to be copied blindly. It is meant to make the workflow inspectable.

I did not type a for loop. I made a stack of decisions, all of them about intent, structure, and trade-offs. The mix has shifted over the last year. More design, less syntax. More review, less translation.

The total work is not smaller. The hours feel different at the end of the day, in ways I am still adjusting to.

I do not use the full loop for everything

This is important. I do not use the full multi-agent loop for every change.

A copy edit, a small refactor, or a low-risk internal tweak does not need a five-agent procession with tiny ceremonial hats. The workflow earns its keep when the work has ambiguity, risk, cross-cutting impact, or enough surface area for quality drift to hide.

The way I think about maturity looks roughly like this:

Level	Workflow	Best for
Level 1	Spec plus Claude Code session plus human review	Small, low-risk features
Level 2	Spec plus standards plus one reviewer agent	Medium-risk changes
Level 3	Planner plus builder plus reviewers plus closer	Complex or high-risk features
Level 4	Team-owned standards, versioned agents, measured outcomes	Scaled engineering organizations

That framing matters because the goal is not to make every change heavier. The goal is to match the workflow to the risk.

A mature engineering organization already does this with architecture review, security review, testing depth, and release controls. Agentic engineering needs the same judgment. Use the full loop when it earns its keep. Do not build a cathedral for a closet shelf.

What the loop costs

I will not pretend this is free. Five honest costs are worth naming.

Up-front investment. The first time you do this on a new project, you are writing the standards, the CLAUDE.md, the slash commands, and the subagents. That is real work. It pays back inside two or three features, but the first feature feels slower because you are building the factory and the product at the same time.

Leaders can get impatient with this trade. Senior engineers usually recognize it. The way through it is to call it out explicitly so nobody confuses ramp cost with permanent cost.

Discipline. The loop only works if you keep spec.md, todo.md, and lessons.md honest. If todo.md drifts out of sync with reality, the next session resumes from a lie. I have done this. It is recoverable, but the cost is real.

The fix is boring and effective: end every session with the state files matching the world.

Judgment load. You make fewer keystrokes and more decisions. That is a different kind of tired. By the end of a long session, my hands are fine and my brain is cooked.

If you go in expecting AI-assisted development to feel like a vacation, you will be disappointed. You may also become worse at it, because the hard part moved instead of disappearing.

False confidence. The output looks polished. The tests pass. The reviewer subagents say the diff is clean. It is easy to assume the code is correct and ship it without enough thought.

The discipline of /qcheck and human review exists because polish is not correctness. The loop is fast. That speed deserves a counterweight, not a victory lap.

Thin specs. This one deserves its own warning. The whole loop is built on the spec. If you skip Step 1 and go straight to Step 3, you will get something. It may look fine. It will often be wrong in ways you only discover when a user does something the spec would have caught.

Every time I have been burned by this workflow, the root cause has been a thin spec. Always.

The wins are bigger than the costs, but only if you are honest about both.

What I am still working out

A short section, in the spirit of saying the loud part out loud.

I have not figured out how to share a multi-agent workflow across a team without flattening the parts that make it work. The standards layer scales. The subagents scale. The slash commands scale.

The judgment that decides when to override the planner, when to ignore a reviewer flag, and when to throw out the day's work and rewrite the spec does not scale by copying files. It scales through mentorship, practice, and shared review.

I am still working out what that mentorship looks like in a world where the keyboard time is increasingly delegated. The half-formed thought I am testing: the apprenticeship model has to move from pairing on code to pairing on specs and reviews. Senior engineers narrate why they reject a planner's task list, not why they wrote a particular function.

I also have not solved the cost problem at scale. A multi-agent workflow uses more tokens than a single session. For a senior engineer running this loop on high-leverage work, the math can be easy. Across a large engineering organization, the math becomes a real conversation, and one worth having with finance partners before the usage curve surprises anyone.

Where I am leaning: tier the workflow. Full multi-agent loops for complex or high-risk features. Lighter single-session work for the routine path. The org-wide bill is a portfolio, not a flat rate.

There is also a measurement problem. The most interesting gains are not always visible in raw throughput. Some show up as better specs. Some show up as fewer review cycles. Some show up as defects that never escape because the workflow caught them early. Those are real outcomes, but they require better instrumentation than "how many lines of code did the AI write?"

The metric I am watching most closely right now is rework rate, the share of pull requests that require substantive changes after initial implementation. It is the cleanest proxy I have found for whether the spec was real and the specialists did their job.

If you have figured out how to scale the judgment layer, the cost model, or the measurement model, I would like to hear about it.

I am still learning this part.

What this means for engineering leaders

This is not just a personal productivity setup. One credible shape of an AI-native software organization looks a lot like this workflow scaled to a team.

The spec layer is product engineering, formalized. It forces the conversation about what we are building before we start arguing about how. It is the same discipline we have always wanted from product specs, now reinforced by the fact that an AI cannot proceed reliably without one.

Teams that have struggled to maintain spec discipline in the past will discover, sometimes painfully, that AI exposes the gap immediately. That is a feature, not a bug. The bottleneck moves from typing to thinking, which is exactly where it should be.

The preferences and standards layer is platform engineering. Shared CLAUDE.md baselines. A team-owned standards skill. A common set of subagents for review work. Permission baselines that ship with the platform, not assembled by every developer at their kitchen table.

This is the same pipeline-first culture shift work, applied to AI tooling instead of CI/CD. The same uncomfortable conversations apply. Centralized defaults feel slower in the moment and faster across the year.

Standardization is not the enemy of senior engineers. Done well, it frees them to do the work only senior engineers can do.

The orchestration layer is delivery engineering. Multi-agent workflows, each role with a job and an interface, look a lot like any well-designed delivery system. The Agentic SDLC post was about why the boundaries matter. This post is about what it feels like when you actually draw them.

At team scale, those boundaries need owners. Subagents need versioning, review, and a path to retirement. Standards need a steward. Specs need a template the team agrees on. None of that is glamorous work. All of it determines whether the system gets better every week or rots into shadow tooling.

The leader's job in this world is not to tell people which prompts to use. It is to make sure the spec layer is real, the standards layer is owned, and the orchestration layer has paved paths.

The teams that learn how to do that early will have an advantage over teams still treating AI coding as a personal productivity hack.

A note on what "senior" means now

The bar for senior engineering is changing.

Writing correct code quickly still matters. It will always matter. But speed at the keyboard is becoming less differentiating by itself because the model can now do more of that translation work.

The differentiating skills are moving upstream and downstream: writing a clear spec, holding the line on standards, designing the boundaries between agents, reading a diff with skepticism, knowing what to test, knowing what to mock, and knowing when the model is confidently solving the wrong problem.

Those skills were always valuable. AI just makes them more visible.

If you are a leader thinking about hiring, promotion criteria, or career ladders, this is the conversation worth having now. Not because every engineer needs to become an AI whisperer. Because every senior engineer will need to become better at defining intent, constraining systems, and reviewing work produced by something other than a human teammate.

That is a real shift. It is also a healthy one.

What changes Monday

If you want to start running this loop, you do not need to rebuild your tooling. Start with five things, in order.

Pick one feature on one project. Small enough to ship in a week. Real enough to matter.
Write the spec with Claude, not for Claude. Use the seven sections above. Answer the open questions before you write code.
Make sure your CLAUDE.md and your standards file are real. If they are not, today is the day. Two hours of writing can pay back for months.
Run the build with one specialist in the loop. A reviewer is the easiest place to start. Read-only. Narrow tools. Explicit findings.
Close the loop. Run a skeptical review pass against the spec and the standards before you push. Even one prompt works. /qcheck is just a saved version of that discipline.

That is the minimum viable version.

From there, add planners, docs subagents, and the rest as the absence of each one starts to hurt. The way you know you need a new specialist is simple: it is the moment you catch yourself wishing for one. Build it then. Not before.

If you are a leader and you want this to scale beyond a few enthusiasts, the equivalent five things look different.

Identify the one team most likely to make this work. Senior, motivated, willing to share what they learn. Resource them to pilot.
Make the standards layer a platform deliverable. Owned, versioned, governed. Not a wiki page with good intentions and an expiration date.
Define what a spec looks like for your organization. Template it. Make it the price of admission for any AI-assisted feature.
Build a small library of reusable specialists. Reviewers first. They have the highest signal-to-noise.
Measure something honest. Pick one: lead time on AI-assisted features versus comparable ones, defect rate on AI-assisted code, reviewer flag rate per pull request, or rework after initial implementation. Watch the trend for two quarters. Decide based on data, not vibes.

The loop is not magic. It is an operating model that makes good engineering discipline easier to repeat.

Write the spec. Set the standards. Let the specialists work. Read the output with the same skepticism you would bring to a pull request from your sharpest skeptical colleague.

Ship when the system earns your trust. The bottleneck is no longer the typing. It is the thinking. That is good news for engineers who like to think, and a fair warning for the ones who built a career on speed alone. Take the deal.