Claude Code Is a Build System, Not a Chatbot

A year of refinement, six principles, and the configuration to back them up.

Repo: github.com/vscarpenter/claude-code-build-system. Fork it, take what is useful, skip the rest.

Most Claude Code setups start the same way mine did. You treat it like a very smart chatbot you have to babysit. You write detailed prompts. You paste in style guides. You ask nicely for tests. It works when you are paying close attention. It fails silently when you are not.

The setup I use now treats Claude Code more like a build system: configurable, layered, observable, and designed to keep working when my attention drifts. That shift took thirteen months, a lot of failed experiments, and enough humbling moments to keep the learning honest.

I do not mean Claude Code replaces CI, Make, Gradle, Xcode, or your deployment pipeline. I mean it deserves the same operating model we give build systems: explicit rules, repeatable workflows, guardrails, logs, fast feedback, and a bias toward automation over memory.

This post covers the configuration I have converged on across daily use on multiple projects spanning JavaScript, TypeScript, Java, Swift, Python, and a stack of utility projects. Six principles, the actual code that implements them, what I got wrong, and a starter kit if you want to skip some of the fumbling.

The shift: discipline does not scale

For the first few months, I did what most people do. I wrote detailed prompts. I pasted in style guides. I asked Claude to run tests, follow conventions, and avoid known traps. It worked when I was fully engaged. It failed the moment I got distracted.

The shift came when I stopped writing instructions and started writing enforcement. Instead of “please run the tests after editing,” I added a hook that runs a fast type check automatically. Instead of “remember to follow the accessibility checklist,” I built a subagent that reviews changed .tsx files against WCAG AA. Instead of “use safeParse rather than parse on user input,” I documented the lesson in tasks/lessons.md and reinforced it across sessions.

The principle is simple: anything you have to remember to do, you will eventually forget. Anything you can put in a hook, a skill, a subagent, a slash command, or a denylist will keep working when you stop paying attention.

Everything else in this post is a corollary.

Principle 1: write the standards once, reference them everywhere

I have a 753-line coding-standards.md checked into the repo. It has ten parts covering agentic behavior, code quality, testing, security, git workflow, ADRs, task management, prompt engineering, reusable skills, and a quick-reference list of red flags.

That sounds excessive. It is not. The cost of writing it once amortizes across every session, every pull request, and every new contributor, human or AI. The benefit is that when I tell Claude to “follow the standards,” it actually has standards to follow. Not “be careful,” but explicit, prescriptive rules.

The difference is structural. It is the gap between asking a junior engineer to “write good code” and handing them a clear style guide.

Two files do the work:

CLAUDE.md loads automatically every session. It includes project conventions, gotchas, file locations, and what not to do. I keep it tight, roughly 200 lines, because it is always in context.
coding-standards.md holds the long-form rules. Skills, subagents, and slash commands reference it when they need deeper guidance.

Each rule includes the reason behind the rule, not just the rule itself. That matters because future-me still needs to judge edge cases the document did not anticipate. “Use parameterized queries” without “to prevent SQL injection because of this attack pattern” produces a parrot. Include the why and you produce a thinker.

I treat updates to these files as engineering work, not as edits to a notepad.

Principle 2: make the right thing automatic, even when it costs you

I run three custom hooks, all in shell scripts under ~/.claude/hooks/. None of them is doing rocket science. The value is that they fire deterministically on relevant lifecycle events, with no prompting from me.

audit-command.sh fires before every Bash tool call. It logs the command with timestamp and session ID. It pattern-matches against high-risk shapes such as curl | sh, eval, exec, and git push --force, then routes risky commands to a separate flagged- log. It is an audit trail, not a blocker. I review the flagged file weekly.

capture-decision.sh fires after file mutations. It captures the file path, session ID, and a git diff --stat snapshot to a daily decision log. When I find a regression two weeks later, I can scroll back through ~/.claude/decisions/decisions-2026-04-25.log and find the likely cause.

persist-memory.sh is the one I would write first if I were starting over. When a session ends, it reads the session transcript, pipes it into a fresh claude --print invocation, asks the model to extract one to three reusable learnings, and appends them to ~/.claude/lessons.md. Claude summarizes Claude. Over weeks, that file becomes an institutional memory of patterns and gotchas I would otherwise forget.

Here is the illustrative shape of the script:

#!/bin/bash
# ~/.claude/hooks/persist-memory.sh

set -euo pipefail
MEMORY_FILE="${CLAUDE_MEMORY_FILE:-$HOME/.claude/lessons.md}"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M')

INPUT=$(cat)
TRANSCRIPT_PATH=$(echo "$INPUT" | jq -r '.transcript_path // empty')

[ -z "$TRANSCRIPT_PATH" ] && exit 0
[ ! -r "$TRANSCRIPT_PATH" ] && exit 0

# Cap transcript size. JSONL lines can be tens of KB each, and a long
# session will blow past the prompt limit. Recent context is what matters.
TRANSCRIPT_TAIL=$(tail -c 65536 "$TRANSCRIPT_PATH")
[ -z "$TRANSCRIPT_TAIL" ] && exit 0

# Bound the call so a stuck CLI invocation cannot hang the hook.
set +e
LEARNINGS=$(printf '%s' "$TRANSCRIPT_TAIL" | timeout 120 claude --print \
  "Review this Claude Code session transcript (JSONL). Extract 1-3
   specific, reusable learnings. Format each as a single bullet on its
   own line: '- [YYYY-MM-DD] <concise learning>'. Output ONLY the
   bullets, nothing else. If there are no genuine insights, output
   nothing at all.")
RC=$?
set -e

# On non-zero exit, claude writes the error to stdout. Do not append it
# to the memory file or you will persist things like "Prompt is too long".
[ "$RC" -ne 0 ] && exit 0

# Keep only well-formed bullet lines.
LEARNINGS=$(printf '%s\n' "$LEARNINGS" | grep -E '^- \[' || true)

if [ -n "$LEARNINGS" ]; then
  {
    echo ""
    echo "<!-- Session: $TIMESTAMP -->"
    echo "$LEARNINGS"
  } >> "$MEMORY_FILE"
fi

The version in the repo adds debug logging and a TTY guard for safer manual testing. Either way, the pattern is general: when you want the system to learn from itself, pipe durable state into a fresh claude --print call and append the output somewhere the next session can see it.

The cost is real. Hooks add latency. Nested Claude calls add usage cost. Hooks can also fail silently if you break them. I wrote the audit hook in March, broke it accidentally in April when I refactored the JSON parsing, and did not notice for nine days. That is the tax.

You also need to choose the right lifecycle event. A fast type check works well as a Stop hook because it runs when Claude finishes responding. Memory persistence belongs closer to a SessionEnd hook because it needs the full transcript. If you run a nested Claude call from a session-end hook, set an appropriate timeout. Otherwise, the clever automation you lovingly crafted may quietly tap out like a marathon runner in dress shoes.

For project-level enforcement, I use lighter hooks per repo. One project has a PreToolUse hook that blocks edits to .env*, lockfiles, and migration files. A PostToolUse hook runs bunx eslint --fix on edited TypeScript and JavaScript files. A Stop hook runs bun typecheck as a fast last-line check.

The pattern is simple: project hooks for project-specific safety, global hooks for cross-project intelligence.

Principle 3: specialists beat generalists

The main Claude session is a generalist. It is good at a lot of things. It is not deeply expert at any one thing. When a domain has hard, non-obvious constraints, I push the work to a specialist subagent.

Two examples carry most of the weight.

a11y-reviewer is a read-only subagent that reviews changed .tsx files against the WCAG AA baseline from my standards doc. It checks semantic HTML, keyboard accessibility, drag-and-drop behavior, Radix dialogs, form labels, image alt text, color-only state, and modal focus. It returns findings as file:line, issue, fix. It does not modify code. The main session decides whether to act.

pg-migration-reviewer reviews changes under db/migrations/** against documented PostgreSQL gotchas. CREATE INDEX without CONCURRENTLY blocks writes for the duration of the build. Adding a NOT NULL column without a default fails immediately on any non-empty table. Foreign keys often need NOT VALID followed by a separate VALIDATE step to reduce lock risk. ALTER COLUMN TYPE for incompatible types can rewrite the entire table.

That is exactly the kind of constraint set you should not expect a human reviewer to remember at 11 p.m. on a Friday.

The subagent format is approachable. Here is the frontmatter for a11y-reviewer.md:

---
name: a11y-reviewer
description: Reviews changed React components against the WCAG AA
  baseline defined in coding-standards.md Part 2. Read-only.
model: sonnet
tools: Read, Grep, Glob, Bash
---

You are a strict accessibility reviewer for this codebase.
The baseline is non-negotiable...

Below the frontmatter is the system prompt, about 40 lines of explicit rules. The body of the agent is just a focused Claude session with one job and a tight tool allowlist.

The win is not that the subagent is magically smarter. The win is focus. Its context window does not get polluted by broader feature work. Its prompt is precise. Its findings are concise and actionable.

I run these subagents on Sonnet because Sonnet is fast and good enough for pattern-matching review work. I save Opus for the main session when I need deeper architectural judgment.

The payoff has been concrete. The pg-migration-reviewer caught a regression in February where I added an index without CONCURRENTLY in a migration meant to run against a populated table. It would have blocked writes during the build. The subagent flagged it before I committed. I have not shipped a migration that locked the database since.

Principle 4: rituals deserve commands

Some moments in development have a fixed shape every time. Starting a feature. Writing the failing test. Reviewing your own work before a pull request. Each is a ritual with predictable inputs, predictable structure, and predictable failure modes if you skip a step.

I encode those rituals as project-level commands and skills. Claude Code now treats older .claude/commands/ files and newer skills as part of the same slash-command surface. My examples still use commands because they are simple and portable. For richer workflows with supporting files, versioned instructions, and more reusable behavior, I would move them into skills.

Three commands carry most of the weight.

/qspec <feature> generates a Spec-Driven Development spec to tasks/spec.md. It includes Goal, Inputs and Outputs, Constraints, Edge Cases, Out of Scope, Acceptance Criteria, and empty test stubs that map to each acceptance criterion. The empty stubs are the magic. They make the spec executable, not just descriptive. No implementation code is written yet.

/tdd <behavior> starts a strict red, green, refactor cycle. It writes a failing test first, pauses for my approval, then writes the minimum implementation. If the change touches db/migrations/**, it invokes pg-migration-reviewer. If it touches components, it invokes a11y-reviewer. The TDD discipline is enforced by the command itself. I cannot accidentally skip the red step unless I decide to bypass the ritual.

/qcheck runs a skeptical staff-engineer review of every changed file. It runs git status and git diff first to understand the work. It checks file size limits, type safety, test quality, project-specific gotchas such as safeParse on user input, bun run test instead of bun test, toast.error() instead of alert(), and the full Definition of Done.

The command file structure is intentionally simple:

---
description: Skeptical staff-engineer code review of changed files
  against coding-standards.md.
---

Review all changed files in this session as a skeptical staff
engineer. Apply the full coding-standards.md and CLAUDE.md rules.

Run `git status` and `git diff` first to see what has changed.
Then evaluate each changed file against:

1. Standards compliance (files <=350 lines, functions <=30 lines...)
2. Type safety (all signatures typed, no `any`...)
[...]

The filename becomes the command. The frontmatter sets the description. The body is the prompt template.

The flow is /qspec to think, /tdd to build, and /qcheck to ship. Three commands, one feature, less improvisation.

Principle 5: memory is a feature, not a side effect

Three layers of memory work together.

tasks/lessons.md holds project-specific gotchas. I review it at the start of every session. When something bites me twice, it goes here. Examples from a recent project include: “auto-generated columns cannot be referenced in custom indexes,” “the import parser strips trailing whitespace before validation,” and “use bun run test, not bun test.” Each entry is one or two lines. Each entry has saved me at least an hour of debugging.

tasks/todo.md holds in-flight work. It has a “Resuming From Here” section so a fresh session can pick up without me re-explaining context. The discipline is simple: never end a session with this file out of sync with reality.

Auto-memory under ~/.claude/projects/<sanitized-cwd>/memory/ is Claude Code’s own memory system. I store user-level facts such as “uses bun, not npm,” feedback such as “TDD is the default workflow,” and project-level decisions such as “merge freeze begins 2026-03-05” as discrete .md files indexed by MEMORY.md. The persist-memory.sh hook feeds new learnings into that broader memory pattern automatically.

Each layer answers a different question.

lessons.md answers: what gotchas exist in this project?

todo.md answers: what was I in the middle of?

Auto-memory answers: what facts about me, this project, and our agreements should persist across sessions?

The cost of writing things down is lower than the cost of remembering them. That is true for humans. It is true for AI agents. It is especially true at the seam between the two.

Principle 6: permissions are safety equipment, not friction

I treat the permission allowlist the way I treat a seatbelt. I want it on. I want it tight. I want it to catch me when I make a mistake.

My global ~/.claude/settings.json denies rm, sudo, chmod, and reads of .env* and **/secrets/**. Those are categories I never want Claude to touch without explicit confirmation. Below the denylist, I allow common development tools such as git:*, gh:*, package managers, language toolchains, and container CLIs.

The goal is balance. Broad enough that I am not approving every command. Narrow enough that nothing destructive is one keystroke away.

Per project, I keep a settings.local.json, which is gitignored, for personal allowances and a settings.json, which is checked in, for the team baseline. The split matters. Anything in the team file ships to every contributor. Anything local is just mine.

I prune the local file every quarter. Left unattended, it accumulates parser-artifact junk, one-shot literal commands, and stale entries from cross-project sessions until it becomes 250 lines of noise wearing a tiny permissions hat.

The audit log from audit-command.sh is the third leg. Even when a command passes the allowlist, the log captures it. I scan the flagged- files weekly.

Twice in the last six months, that scan caught a pattern I would have missed otherwise. The first was a curl <url> | sh Claude proposed for a one-off install. The pattern matched. The flag fired. I read it the next morning and replaced the install with a brew formula. Without the audit hook, I probably would have approved it in the moment and forgotten.

What I tried that did not work

Four antipatterns burned time, so you do not have to repeat them.

A heavyweight Stop hook. My first version ran the full Vitest suite at every conversation end. Thirty seconds, every time, on every project. I told myself it was the right safety net. It was actually a tax on iteration speed. CI catches the same things. I replaced it with bun typecheck, which runs in about three seconds, and I run the full suite manually before pushing. Lesson: hooks must be cheap or they become friction you eventually disable.

Mocking the database in tests. I had a test suite where every Dexie call was mocked. The tests passed. The first real migration failed in production because the mocks did not model index behavior. A user lost local task data when the schema upgrade silently corrupted the indexed lookup. I switched to fake-indexeddb, a real IndexedDB implementation in memory. Slower tests, fewer false positives. Lesson: prefer fakes over mocks when the boundary is structural.

Letting settings.local.json grow unbounded. Every “yes, allow this” click added a line. After a year, I had 256 entries. Half were stale: pnpm commands from a different project, Ruby and CocoaPods leftovers, and parser-artifact junk like lone done and fi lines from heredocs. I ran the fewer-permission-prompts skill, pruned the file to 101 entries, and added a quarterly calendar reminder to do it again. Lesson: anything that grows monotonically needs an explicit shrink ritual.

Trusting the AI to remember the gotchas. Early on, I documented a gotcha in a code comment. The comment got refactored away. The gotcha bit me again two weeks later. The fix was to move durable lessons into tasks/lessons.md, which gets loaded into context at the start of every session. Comments are for explaining why nearby code exists. lessons.md is for explaining what to never do again.

The leadership lesson hiding inside the dotfiles

This started as a solo-developer setup, but the pattern is bigger than personal productivity. The enterprise version of this is not “everyone copy my dotfiles.” That would be a terrible operating model, and an even worse onboarding experience.

At team scale, personal hooks become shared policy. Local preferences become governed defaults. Slash commands become team workflows. Subagents need ownership, versioning, review, and a path to retirement. Standards need stewardship. Permissions need a baseline. Logs need a purpose beyond “future-me may want this someday.”

In other words, this becomes platform work.

That is the leadership lesson for me. The most useful AI engineering systems will not come from asking every developer to become a prompt whisperer. They will come from giving teams safe defaults, paved paths, observable behavior, reusable workflows, and enough room for senior engineers to adapt the system to real work.

The same principles we use for cloud platforms apply here. Reduce undifferentiated decisions. Make the secure path the easy path. Instrument the system. Preserve local autonomy where it matters. Build for learning, not just control.

That is where AI-assisted development gets interesting. Not in the demo where the agent writes code once. In the operating model where the system gets better every week.

What changes Monday

If you have read this far, you are probably weighing the cost of improving your setup against the cost of keeping the one you have. Here is the smallest version that returns most of the value. Five things, in order. Maybe two hours of work.

Start with context, then fast feedback, then memory, then repeatable rituals, then auditability. That order matters. You are building a system of control before you add a system of speed.

1. Write a CLAUDE.md. Keep it to roughly 200 lines. Include project conventions, file locations, gotchas, and what not to do. This single file delivers more value than any other piece of configuration because it gives every session a shared starting point.

2. Add a fast typecheck hook.

{
  "hooks": {
    "Stop": [{
      "matcher": "",
      "hooks": [{
        "type": "command",
        "command": "bun typecheck 2>&1 | tail -10",
        "timeout": 60
      }]
    }]
  }
}

Replace bun typecheck with tsc --noEmit, mypy, cargo check, swift build, or whatever is fast in your stack. The point is to catch the obvious stuff before the conversation ends.

3. Add tasks/lessons.md and reference it from CLAUDE.md. When something bites you twice, add a line. Review it at the start of every session. This is the cheapest memory system you will ever build.

4. Build one custom command or skill for your most common ritual. For me, it was /qcheck. For you, it might be /release-notes, /db-migration, /triage-bug, or /incident-summary. Pick the thing you do every week and never want to re-invent again.

5. Add the audit-command.sh hook. Log every Bash command. You do not need to do anything sophisticated with the log on day one. You will be glad it exists the first time something breaks and you need to retrace what happened.

Or, if you want to skip the assembly, the full configuration is at github.com/vscarpenter/claude-code-build-system. The repo mirrors the .claude/ and ~/.claude/ layout you would set up locally, so you can clone, copy, and run.

That is enough for the first pass. No plugins, no MCP servers, no elaborate subagent fleet required. Once those five pieces are in place, the next thing to add will become obvious because you will feel its absence.

The bigger point is not the configuration. It is the operating model. Discipline does not scale. Automation does. Standards belong in files. Specialists beat generalists. Rituals deserve commands. Memory is a feature. Permissions are safety equipment.

Pick one. Start Monday.

Written April 2026. Configuration tested across thirteen months of daily Claude Code use on macOS. This is a solo-developer perspective; team-scale configuration adds governance concerns I am still working through. Hooks shown here use Bash. Windows users will need WSL or PowerShell equivalents.

Claude Code Is a Build System, Not a Chatbot

The shift: discipline does not scale

Principle 1: write the standards once, reference them everywhere

Principle 2: make the right thing automatic, even when it costs you

Principle 3: specialists beat generalists

Principle 4: rituals deserve commands

Principle 5: memory is a feature, not a side effect

Principle 6: permissions are safety equipment, not friction

What I tried that did not work

The leadership lesson hiding inside the dotfiles

What changes Monday

The Bottleneck Is Never the Stack

The Spec Is the Product

The Agentic SDLC: Uniting the AI Tool Sprawl

Claude Code Is a Build System, Not a Chatbot.

The shift: discipline does not scale

Principle 1: write the standards once, reference them everywhere

Principle 2: make the right thing automatic, even when it costs you

Principle 3: specialists beat generalists

Principle 4: rituals deserve commands

Principle 5: memory is a feature, not a side effect

Principle 6: permissions are safety equipment, not friction

What I tried that did not work

The leadership lesson hiding inside the dotfiles

What changes Monday

The Bottleneck Is Never the Stack

The Spec Is the Product

The Agentic SDLC: Uniting the AI Tool Sprawl

Claude Code Is a Build System, Not a Chatbot