How we stopped hoping our AI was right and started knowing it

The first version of our pipeline was fast, impressive, and wrong in ways we didn’t notice for weeks.

We’d given the model everything — campaign data, performance history, budget state — and asked it to decide. It decided confidently. Forty-two times, it decided wrong: inverted logic, overridden human choices, recommendations that looked plausible and weren’t. These weren’t hallucinations. They were what unbounded AI reasoning looks like in production.

Account managers trust it now. They didn’t before. The difference wasn’t a smarter model — it was a harness. Explicit gates where we used to trust the model. Evaluated outputs instead of “trust me bro, the AI got it.”


What went wrong

The pipeline generates campaign recommendations for hundreds of ad campaigns every night. At first, we treated the model like a smart analyst: give it everything, let it figure it out.

The failures were subtle. The model would see a performance metric that represented a floor — the minimum acceptable threshold — and reason about it as a ceiling. Recommendation: do the opposite of what the campaign needed. It would look at underspend and suggest removing targeting restrictions that an account manager had deliberately set. It would recommend adding keywords to a campaign already burning through its entire daily budget before noon.

None of this looked broken from the outside. The language was fluent. The logic was internally consistent. The model was just reasoning in a space where we’d never told it where the boundaries were.

Unbounded reasoning isn’t a model problem. It’s a design problem.

From unbounded to scoped

BEFORE Campaign data LLM reasons + decides Recommendation 42 failure incidents: inverted logic, ignored history AFTER Campaign data Code classifies pre-solved LLM renders only Recom- mendation Failure rate drops. System becomes auditable.

What the harness means in practice

A harness — in the sense we use it here — is the set of constraints, gates, and checks that make AI output trustworthy rather than hopeful. It doesn’t replace AI reasoning. It defines exactly where that reasoning happens, and what happens to its output.

The pipeline today runs five distinct AI tasks. Each one has a defined job. One model acts as a German SEO expert: given the ad titles for a campaign and real search-term demand data, it judges whether those ads represent one coherent buyer intent or several competing ones — a genuine classification decision — then generates a minimal, high-coverage keyword list grounded in what buyers actually search for. Output goes to the account manager, who can read it, question it, and override it. Another task asks the model to assess keyword relevance: does this term actually fit what this advertiser sells? Output shown, override available. A third receives a fully pre-solved diagnosis — budget state, performance verdict, historical outcomes of similar changes across hundreds of campaigns — and renders it into a plain-language recommendation. One model classifies a dashboard screenshot with a single word: OK, LOADING, or EMPTY.

Different tasks, different levels of AI involvement. The pattern is constant: scope defined in advance, output visible, human override always available.

Where the model kept wandering, we replaced it with deterministic code. Where its judgment genuinely adds value, we kept it — but added evals: checks that run before the model and after, verifying the output against criteria we defined when we weren’t looking at any specific result. Criteria first. Output second. That’s the shift from hope to knowledge.

The failure taxonomy — five failure categories, fifteen invariants, forty-two catalogued incidents — is the ledger of every time we learned what “wrong” looks like specifically enough to check for it automatically. The harness is accumulated scar tissue.

Two-tier evaluation

Generated recommendation Tier 1 — Automated checks strict structural + threshold-based ✗ BLOCKED rec not shipped ✓ PASSES → manual review Tier 2 — Manual review reviewed against failure taxonomy New incident → taxonomy entry → invariant → system prompt update

Who built it

The most important thing we learned wasn’t technical.

The account managers who act on the recommendations are the ones who catch the subtle failures. “This rec told me to add keywords to a campaign that’s already maxing its daily budget.” That message lands in Slack. A new incident gets opened. The incident becomes a taxonomy entry. The taxonomy entry becomes an invariant. The invariant gets encoded into the system prompt.

The account manager never wrote a line of code. They shaped the system as much as any engineer did.

The harness feedback loop

OVERNIGHT DAYLIGHT Pipeline runs overnight, automated Recommendation delivered to account manager AM acts on rec real campaigns, real budget Spots failure "wait, that doesn't look right" Incident logged Slack → taxonomy entry Invariant defined → new rule System prompt updated — 150 lines, growing loop repeats every night

This is the actual structure of the work. The pipeline runs overnight. The discipline that makes it trustworthy is built in the daylight — in Slack threads, in incident reviews, in “wait, that doesn’t look right” moments from people with no formal obligation to investigate but who do anyway, because they’re trying to do their jobs well.

Every line in that 150-line system prompt has a story behind it. Most of those stories started outside the engineering team.

That changes the definition of “shipped.” The first version that ran in production wasn’t finished — it was instrumented. The real build happened after, in the feedback loop between the people generating recommendations and the people acting on them. The harness isn’t a document you write before launch. It’s what accumulates after.

This is what co-development actually looks like at the system level. Not a workshop. Not a shared roadmap. A loop: failure surfaces, failure gets named, name becomes rule, rule gets enforced. The people closest to the output drive the loop. The engineers encode what they catch.


The horse

There’s a reason “harness” is the right word for this.

A harness doesn’t slow a horse down. It channels the force. The horse is still running flat out — the power is still there. What the harness adds is direction, load-bearing structure, and the ability to actually use what the animal can do.

AI reasoning is like that. The model generates keyword lists a human analyst would take hours to produce, spots cross-campaign patterns across hundreds of advertisers simultaneously, classifies at a scale no team could match. None of that goes away. What the harness adds is the structure that makes it usable — scoped tasks, checked outputs, visible decisions.

In a previous post about the 5 levels of AI integration, I described L4 — “Harness-Driven” — as the level where AI operates in a closed feedback loop, with humans defining acceptance criteria. The unsolved problem I flagged: real L4 needs to be legible to someone other than its author. The failure taxonomy is the answer.

Not an elegant one. The one that works.

The harness isn’t a document you write before launch. It’s what accumulates after — from everyone who catches a mistake and names it precisely enough that it never happens again. The account managers who flag a bad recommendation in Slack, the engineers who encode that flag into an invariant, the system prompt that grows by one line: that’s the loop. That’s co-development at the system level.

The goal was never to limit the AI. It was to stop hoping it was right and build something that could know.