The first useful thing my publishing verifier did was embarrass me.
Not publicly, thankfully. It caught me before I published a technical claim I had read six times and stopped seeing. I had written a memory requirement for a local model as if it were a simple property of the model itself, instead of something shaped by model size, quantization, context window, runtime, and machine setup.
The sentence felt precise because it had a number in it. That is the dangerous part. A number can make a weak claim look like a receipt.
The verifier made me choose: source the details cleanly, rewrite the claim with the right constraints, or remove it. I wanted to override the flag because the post was done, the argument was working, and I was tired of sanding the same paragraph. Instead, I checked the claim.
The verifier was right.
That was annoying, useful, and exactly the point.
The uncomfortable part is that the error did not survive because I was careless. It survived because I was the reviewer, and I had already read the sentence too many times. By the third pass, I was not really reading anymore. I was recognizing. Every writer knows this failure mode, and every engineering leader has seen the code review version of it: the large pull request that gets a fast approval because the reviewer sees the shape of correctness instead of the details.
For years, my defense against shipping something wrong was attention. Mine, or a colleague's. Check the claims. Check the links. Check the structure. Check the metadata. Check the voice. Check whether the whole thing still says what I think it says.
That defense was always a process running in someone's head. It was unversioned, undocumented, and enforced only when the reviewer happened to be sharp, caffeinated, and not rushing to hit publish before a Monday morning meeting.
So I did what I would do with any process that fails under load. I stopped trusting the human to run it perfectly and encoded the judgment into a gate.
I built a verifier that reviews every post before it ships. This post went through it too.
Proof of Work is a series where I build the argument instead of only making it. Each post documents something I actually shipped, what broke along the way, and what it taught me about how software gets made now.
The bottleneck moved from generation to verification
The industry is spending enormous energy trying to make AI generation more useful. Better models, better prompts, better context, better retrieval, better agents. All of that helps. I use all of it.
But it does not solve the hardest part of the problem.
When generation was expensive, quality control lived partly inside the act of creation. A senior engineer writing code slowly and carefully was part of the quality system. Code review existed as a backstop: a second set of eyes, a forcing function for documented reasoning, and a gate that could say no.
That model worked because the volumes matched. Humans produced work at human speed, and humans reviewed it at human speed.
Agents broke that math.
Generation is now cheap enough and fast enough that review becomes the constraint. Teams adopting AI tooling discover the pattern quickly. The hard part is no longer producing more code, more tests, more documentation, more design options, or more analysis. The hard part is verifying that the work is correct, safe, useful, maintainable, and aligned with intent.
Piling more human attention onto the problem does not scale. It recreates the four-minute code review at higher volume and calls it governance.
The answer is not less human judgment. The answer is more durable human judgment.
That means taking the review process out of someone's head and turning it into an artifact that can run every time, at full sharpness, even when the human is tired, rushed, or too close to the work to see it clearly.
I have been arguing this in the abstract for a while. The AI Operating Ledger, my framework for measuring what enterprise AI actually returns, says the metric that matters is value per verified outcome. That sounds good on a slide. It becomes much more uncomfortable when you realize you cannot measure verified outcomes without a verifier.
At some point, you have to stop describing the gate and build one.
So I built one for the production system I ship to most often: this blog.
What the verifier actually does
The verifier is a Claude Code skill that runs against the final markdown file before publication. A skill, in this context, is a reusable instruction package that Claude Code can invoke consistently. Mine acts less like a chatbot and more like a review harness.
It does not ask what kind of feedback I want. It does not rewrite the post. It does not try to make the prose warmer, smarter, punchier, or more likely to perform on LinkedIn, which is good because the internet has enough confident beige rectangles already.
It runs a review pass and produces a structured report.
The workflow is simple. A draft lands in the blog repo. Before publication, the verifier runs against the final markdown file. It returns blocking issues, warnings, and notes. A blocker requires either a fix or a written override. If I override the flag, the rationale stays with the post.
No vibes. No "can you make this better?" No conversational drift where the model slowly becomes my co-author, therapist, and brand strategist in the same window.
The draft clears the gate, or it does not.
The verifier checks four categories.
Claims and sources
Every statistic, date, named event, quoted figure, pricing claim, product claim, and technical assertion gets inspected. The verifier extracts the specific claim and checks whether the post gives it enough support.
If I cite a layoff number, a model release, a cloud pricing change, a benchmark, a product capability, or a hardware requirement, the verifier asks a basic question: can this claim be traced?
Claims it cannot verify do not get silently removed. They get flagged.
That distinction matters. The verifier does not own the essay. I do. But it makes casual confidence more expensive, and that is the feature.
A weak claim can still survive if I choose to keep it, but it cannot survive by accident.
Voice drift
My essays often come together across multiple writing sessions, and AI is usually somewhere in the loop. Stitched-together writing has a tell. Sentence rhythm flattens, transitions get too clean, and phrases appear that sound like someone wearing my jacket.
The verifier checks for that.
It compares sentence rhythm against my usual writing, scans for a banned-phrase list I maintain, flags filler, and catches generic AI phrasing before it settles into the draft. It also hunts down em dashes, which I have banned from my own writing because apparently I needed a constitutional amendment for punctuation.
I am aware of the irony of building an AI-assisted system partly to stop AI from making me sound like AI.
The irony does not make the check less useful.
Structural seams
A post drafted in three sessions will often have three openings hiding inside it. Sometimes the argument restarts halfway through. Sometimes a section restates the thesis instead of advancing it. Sometimes a transition looks polished but does not actually bridge the gap.
The verifier looks for those seams.
This check came directly from assembling my book manuscript, where session boundaries sometimes showed up as visible scar tissue in the text. The verifier is not perfect at this, and I will get to that in a minute. But it catches enough to stay in the suite.
It does not replace editorial judgment. It makes the places that need editorial judgment harder to miss.
Style enforcement
AP style. Metadata completeness. Link hygiene. Image alt text. Passive constructions. Long sentences. Weak verbs. Banned punctuation. Repeated phrases. Missing descriptions. The boring checks.
Boring checks are the ones humans skip first, which is exactly why they belong in the gate.
A good verifier does not need to be glamorous. In fact, if it starts feeling glamorous, I probably built the wrong thing. The job is not to impress me. The job is to stop predictable defects from reaching production just because I had a confident Sunday night.
Three design decisions that mattered most
Every build has a handful of choices that quietly determine whether the thing earns its keep. These were mine.
A skill, not a checklist
I had a publishing checklist for a year. It lived in a markdown file, and I followed it the way everyone follows checklists: completely, until the week I was busy.
A checklist is an intention. A skill is enforcement.
The difference is the same one that separates a coding standards wiki page from a linter wired into CI. I have spent my career watching that difference decide outcomes. Standards that depend on memory become optional under pressure. Standards that run every time become part of the system.
That holds at the scale of one person publishing essays. It also holds at the scale of an enterprise trying to ship safely with more automation than it had last year.
The moment the review became executable, the standard changed from something I meant to do into something the system required me to confront.
The verifier never touches the generator
The verifier runs as a separate pass, with separate instructions and separate permissions. It is forbidden from rewriting the post. It reports. It does not fix.
This was deliberate, and I would defend it as the most important line in the design.
An author cannot grade their own homework, and neither can the model instance that helped draft the work. If the same system generates, reviews, and rewrites the artifact, you do not have independent judgment. You have one system with two moods.
The separation is not perfect. The verifier may still use the same general class of model that helped with the draft, so I am not pretending this is mathematical independence. The value comes from role separation: different instructions, different permissions, structured output, source tracing, no rewrite authority, and an explicit handoff back to the human.
That boundary is what makes the report worth reading.
It also preserves authorship. I do not want a system that quietly turns every critique into a rewrite. That feels efficient, but it blurs the most important question: who decided this was good enough to ship?
In this workflow, the answer is still me.
The gate informs the decision. It does not launder responsibility.
Overrides require a written rationale
I can ship past a flag, but I have to say why.
That felt like bureaucratic theater when I added it. It turned out to be the feature.
Writing "intentional fragment for rhythm" is easy. Writing "I am pretty sure this number is right" is hard, because typing that sentence makes you go check the number.
Here is a good override:
FLAG: Sentence fragment.
DECISION: Override.
RATIONALE: Intentional fragment for rhythm after the opening anecdote.
That documents intent.
Here is a bad override:
FLAG: Unverified statistic.
DECISION: Override.
RATIONALE: I believe this number is correct.
That is not a rationale. That is a confession with better formatting.
The override log became a feedback loop. Recurring strong rationales tell me which checks are too rigid. Recurring weak rationales tell me where I am trying to route around my own system.
Anyone who has run a change-management process at enterprise scale knows that feeling. The gate did its job. The actor went looking for a side door.
What I left out matters too.
No auto-publish. No auto-fix. No score predicting whether the essay will perform. No attempt to grade the post's importance. The verifier checks whether the post meets the standard I defined. Whether the standard is good enough is still my job.
I want a quality boundary, not a ghostwriter with opinions.
Where this pattern can go wrong
This section exists because every AI post that skips the failure mode should be read with suspicion, including mine.
The verifier helped immediately. It also failed immediately.
False positives are real
My first version flagged intentional choices constantly. Short fragments I use for emphasis. A deliberately long sentence meant to mimic the run-on feeling of a bad meeting. Repetition used for rhythm. Simple words that looked too simple to the checker.
The verifier read style as drift because I had defined drift too narrowly.
That was humbling in a useful way. Encoding judgment means discovering that some of your judgment was fuzzier than you believed. The checks I could not specify crisply were the ones I had never actually thought through.
That is not a reason to avoid the gate. It is a reason to tune it.
Not every check deserves blocking authority
The em dash hunt works because it is binary. Either the character is present, or it is not.
The structural seam check is different. "This transition papers over a gap" is a judgment call dressed up as a rule, and the verifier cannot always know whether a repetition is accidental or rhetorical.
That distinction matters.
Some checks should block automatically. Broken links, missing metadata, unsourced statistics, and malformed frontmatter are good candidates. Other checks should surface suspicion and force review. Treating every concern like a compiler error is how you build a system people learn to ignore.
That is the first counterargument to the verifier pattern: gates can become bureaucracy with better branding.
The answer is not to make gates everywhere. The answer is to make gates where risk, volume, and repeatability justify them, then measure whether the gate improves outcomes. A verifier that creates noise without reducing defects is not governance. It is theater with a YAML file.
The gate can create false confidence
A verifier can miss things. It can approve work that still contains a bad assumption, a weak argument, a brittle design, or a subtle security issue. Worse, it can make the team feel safer than it is.
That is the second counterargument, and it is the serious one.
A gate is not proof of truth. It is proof that a defined standard ran. Those are different things.
The right conclusion is not "the verifier passed, so the artifact is good." The right conclusion is "the verifier passed the checks we decided mattered enough to encode." That sentence is less magical and much more useful.
This is why experts still matter. The system can enforce known standards. It cannot eliminate the need for people who know when the standard is incomplete, stale, or pointed at the wrong risk.
The gate can encode one person's taste as law
This risk is small on a personal blog and much larger inside an engineering organization.
If I build a verifier for my own writing, it should encode my standards. If a platform team builds a verifier for hundreds of engineers, it had better encode more than one leader's preferences. Otherwise, "quality" becomes a polite word for "the way I like things."
That is how governance gets brittle.
Good gates need ownership, feedback loops, versioning, exception paths, and a way for practitioners to challenge the standard. The goal is not to freeze judgment. The goal is to make judgment inspectable enough that it can improve.
That is the healthy version of encoded judgment.
The leadership translation
Swap "blog post" for "pull request," "service deployment," "architecture decision," "incident review," "runbook," or "client-facing model output," and this is the problem every engineering organization is running into.
Agents have made generation abundant. Verification capacity has not moved nearly as fast.
Leaders keep trying to close the gap with review headcount, approval meetings, and escalation paths. Sometimes those are necessary, especially in high-risk domains. But if the work volume is rising because generation got cheaper, attention-based quality control becomes the next bottleneck.
The verifier pattern is the alternative.
Pull the judgment out of your senior people's heads. Encode it as durable artifacts. Let the review harness, not the reviewer's calendar, enforce the baseline.
That does not make experts less important. It makes them more important in a better way.
Your best people should not spend most of their time inspecting every artifact by hand. They should define the standards, tune the gates, review the exceptions, and improve the system. They move from being the quality control department to being the authors of the quality system.
That is a better use of expertise, and it is the only version of this that scales.
This is the same idea as pipeline-first delivery applied at a different altitude. We remove direct production access not because we distrust engineers, but because routing change through an enforced pipeline makes quality structural instead of personal.
Building the publishing verifier was me removing my own production access.
I no longer ship to my blog directly. I ship through a gate I designed, and the gate does not care how confident I feel on a Sunday night.
Trust the gate, not the actor.
Even when the actor is you.
Especially then.
What changes on Monday morning
The practical move is not to build a giant verifier for everything. That is how good ideas become enterprise wallpaper.
Start smaller.
Pick one artifact that is high-volume, high-friction, or increasingly AI-assisted. Pull requests are the obvious candidate, but they are not the only one. Architecture decision records, release notes, test plans, incident reviews, cloud change requests, threat models, runbooks, and client-facing AI outputs are all better candidates than most teams realize.
Then ask five questions.
First, what defects do we keep catching late?
Second, what judgment are senior people applying repeatedly but informally?
Third, what parts of that judgment can be made explicit enough to run every time?
Fourth, which failures should block the work, and which should only require human review?
Fifth, what does a valid override look like?
That last question is where the system becomes real. A gate without an override path becomes brittle. A gate with lazy overrides becomes decorative. The useful middle is a system that allows exceptions but makes weak exceptions visible.
You do not need perfection to start. In fact, you will not get perfection by starting. The first version will be noisy. Some checks will be too rigid. Some will be too vague. Some will reveal that the team never actually agreed on the standard it thought everyone understood.
That discovery is not failure. That is the work.
The verifier is not just checking the artifact. It is testing whether the organization can describe what good looks like.
Judgment as a durable artifact
The argument I keep making, in the book and on this blog, is that the bottleneck is never the stack. It is clarity of intent, encoded in durable artifacts that survive contact with speed, scale, and fatigue.
For most of my career, my editorial judgment was not a durable artifact. It was a mood with a track record. On good days, that was enough. On rushed days, it was a liability wearing a confident expression.
Now the standard is more concrete. It is a versioned file, a repeatable review, and an override log that makes my decisions visible to myself. It is not perfect, but it is inspectable. That alone makes it better than trusting the best version of me to show up every time.
That is the broader lesson.
Quality cannot depend on the best version of the human arriving at the exact moment the system needs judgment. Not in writing. Not in code review. Not in architecture. Not in agentic delivery.
The work now is to decide which judgments matter enough to encode, which exceptions matter enough to review, and which gates deserve the authority to say no.
That is where leadership moves next.
Not away from human judgment.
Toward judgment that survives speed.
