Halving an LLM writing skill's cost with a deterministic pass

I caught myself using Fable 5, a frontier model, to check whether I’d put a comma in the right place. It worked. It also felt absurd. A regex has flagged passive voice and stray Oxford commas since 1995, and I had rented a frontier model to do the same job.

That moment turned a small side project into a real one. The result is a blog-draft auditing skill that does the same work for roughly half the tokens and half the time, with no drop in quality. Tokens fell from 55k to 27k, time from 276 seconds to 126, with the pass rate steady at 100%. This is how I got there, and the one habit I’d take to the next skill I build.

The actual problem: I’m not a writer

I build things I think are cool and then freeze when it’s time to write about them. I’m not an author, I don’t know how to blog, and the obvious fix (“just have the AI write it”) fails a different test: it comes out generic and inauthentic, and people can smell that. What I wanted was training wheels: a system that holds my writing to a real standard without replacing my voice.

So I built two small Claude Code skills that share one editorial standard:

project-narrative: reads a repo, interviews me one question at a time, and drafts a post. (It wrote this one.)
blog-draft-audit: takes a draft and checks it against the same standard, returning a verdict and a fix list.

The audit skill is where the real engineering happened.

The realization: most of “good writing” is pattern-matchable

The first version of the audit skill did everything in the model. Hand it a draft, and Fable would read every line looking for passive voice, marketese, run-on paragraphs, weak link text, number-style slips, all of it. That’s what made me wince: it’s expensive and slow to use a reasoning model as a linter.

So I did the step that mattered most. I ran a few end-to-end audits, then fed the trace history back into the model and asked: which of these findings could a dumb, deterministic script have caught instead?

The answer was: nearly the whole mechanical checklist. Sentence rhythm, passive constructions, wordiness, Oxford commas, acronym expansion, number formatting, link quality, image references, paragraph length, readability: every one of those is pattern-matchable. None of them needs judgment. They just need to be checked exhaustively and consistently, which is exactly what a model is bad at and a script is good at.

That became the design: a two-pass audit.

A deterministic pass, a standard-library Python script (audit.py) that catches everything mechanical, the same way every time.
A judgment pass: the model, now spending its tokens only on the work that actually needs a brain. Is the lede buried, does the headline make a promise the intro keeps, does the tone fit the audience.

// the mechanical half moves left, the judgment stays in the model

What it bought

Moving the pattern-matchable work out of the model and into the script is where the cost went:

	Before (iteration-1)	After (iteration-2)
Tokens (with skill)	55,007 ± 2,630	27,418 ± 3,057
Time	275.9s ± 32.3s	126.2s ± 19.5s
vs. baseline tokens	~3.4× (baseline 16,149)	~1.58× (baseline 17,400)
Pass rate	100%	100%

The short version: the deterministic analyzer roughly halved token cost and time while holding quality at 100%. Measured against a no-skill baseline, the skill’s overhead dropped from ~3.4× down to ~1.58×. The script runs in about a second and never gets tired or inconsistent on line 400 the way a model scanning prose does.

The honest limit: judgment isn’t free

Here’s the part I won’t oversell. The split is not “deterministic vs. nice to have.” It’s “deterministic vs. irreducible.” The judgment checks stay in the model for a reason. No pattern captures “the headline doesn’t deliver,” and fact-checking a claim costs tokens with no way around it. That 1.58× overhead is the floor, not a number I expect to drive to 1.0×.

And the deterministic approach has a sharp edge I found the hard way. In one benchmark, the plain model baseline actually beat my skill.

The baseline caught a draft that claimed an “89% recovery rate” while the draft’s own failure taxonomy said otherwise, and it flagged a referenced search-architecture.png that didn’t exist on disk. My skill flattened both into a generic “verify before publishing.” Offload work onto a pattern-matcher and the system gets dumber at exactly the cross-cutting judgment a model does for free.

The fix wasn’t to make the script truth-aware. It was to be honest about the boundary: keep fact-checking the author’s job, and have the skill produce a claims-to-verify table. It organizes what to check, it doesn’t pretend to know what’s true. (The file-existence check, though, was mechanical, so that one went into the script.)

Where this honestly stands

This is early, and I’d rather say so than have you find the gap yourself:

The audit skill is the measured half. The numbers above are real but mostly n=1: variance is barely characterized, and a proper three-runs-per-config benchmark is still on the backlog.
project-narrative, the skill that wrote this post, is unbenchmarked. It hasn’t been put through its own eval suite yet.
Two blog skills with adjacent triggers are a classic mis-fire setup. Whether “audit my draft” and “write a post about this repo” reliably route to the right one is untested.
And the original goal (does this actually help a non-writer produce something authentic?) I can’t put a number on yet. A real before/after on writing quality, not just audit cost, is the experiment I still owe.

The takeaway I’d reuse

If I build another skill, the move I’d repeat isn’t the two-pass architecture. It’s the step that found it: run the skill end-to-end, then feed your own trace history back and ask what didn’t need the model. I didn’t guess the deterministic/judgment split up front. The traces showed me where the reasoning budget went: linter work. The optimization was hiding in my own logs.

That’s the cheap, repeatable habit. The frontier model is worth its price on the judgment calls. It’s just embarrassing to pay it for the comma.

The skills, the analyzer, and the benchmark history are in the repo. If you’ve built skills and have a sharper way to measure the writing-quality side, the part I still can’t quantify, I’d genuinely like to hear it.