How we stopped hoping our AI was right and started knowing it

Campaign recommendation pipeline dashboard showing structured AI output

The first version of our pipeline was fast, impressive, and wrong in ways we didn’t notice for weeks.

We’d given the model everything — campaign data, performance history, budget state — and asked it to decide. It decided confidently. Forty-two times, it decided wrong: inverted logic, overridden human choices, recommendations that looked plausible and weren’t. These weren’t hallucinations. They were what unbounded AI reasoning looks like in production.

Account managers trust it now. They didn’t before. The difference wasn’t a smarter model — it was a harness. Explicit gates where we used to trust the model. Evaluated outputs instead of hoped it was right.

What went wrong

The pipeline generates campaign recommendations for hundreds of ad campaigns every night. At first, we treated the model like a smart analyst: give it everything, let it figure it out.

The failures were subtle. The model would see a performance metric that represented a floor — the minimum acceptable threshold — and reason about it as a ceiling. Recommendation: do the opposite of what the campaign needed. It would look at underspend and suggest removing targeting restrictions that an account manager had deliberately set. It would recommend adding keywords to a campaign already burning through its entire daily budget before noon.

None of this looked broken from the outside. The language was fluent. The logic was internally consistent. The model was just reasoning in a space where we’d never told it where the boundaries were.

Unbounded reasoning isn’t a model problem. It’s a design problem.

From unbounded to scoped

What the harness means in practice

A harness — in the sense we use it here — is the set of constraints, gates, and checks that make AI output trustworthy rather than hopeful. It doesn’t replace AI reasoning. It defines exactly where that reasoning happens, and what happens to its output.

The pipeline today runs four distinct AI tasks. Each one has a defined job.

One task judges whether a campaign’s ads chase one buyer intent or several, then builds a keyword list from real search demand. One scores keyword relevance: does this term fit what this advertiser actually sells? One takes a fully pre-solved diagnosis — budget state, performance verdict, historical data — and puts it into plain language. One looks at a dashboard screenshot and answers with a single word: OK, LOADING, or EMPTY.

Every output lands in front of an account manager who can read it, question it, or override it. Different tasks, different levels of AI involvement. The pattern is constant: scope defined in advance, output visible, human override always available.

Where the model kept wandering, we replaced it with deterministic code. Where its judgment genuinely adds value, we kept it — but added evals: checks that run before the model and after, verifying the output against criteria we defined when we weren’t looking at any specific result. Criteria first. Output second. That’s the shift from hope to knowledge.

The failure taxonomy — five failure categories, fifteen invariants, forty-two catalogued incidents — is the ledger of every time we learned what “wrong” looks like specifically enough to check for it automatically. The harness is accumulated scar tissue.

Two-tier evaluation

Who built it

The most important thing we learned wasn’t technical.

The account managers who act on the recommendations are the ones who catch the subtle failures. “This rec told me to add keywords to a campaign that’s already maxing its daily budget.” That message lands in Slack. A new incident gets opened. The incident becomes a taxonomy entry. The taxonomy entry becomes an invariant. The invariant gets encoded into the system prompt.

The account manager never wrote a line of code. They shaped the system as much as any engineer did.

The harness feedback loop

This is the actual structure of the work. The pipeline runs overnight. The discipline that makes it trustworthy is built in the daylight — in Slack threads, in incident reviews, in “wait, that doesn’t look right” moments from people with no formal obligation to investigate but who do anyway, because they’re trying to do their jobs well.

Every line in that 150-line system prompt has a story behind it. Most of those stories started outside the engineering team.

That changes the definition of “shipped.” The first version that ran in production wasn’t finished — it was instrumented. The real build happened after, in the feedback loop between the people generating recommendations and the people acting on them. The harness isn’t a document you write before launch. It’s what accumulates after.

This is what co-development actually looks like at the system level. Not a workshop. Not a shared roadmap. A loop: failure surfaces, failure gets named, name becomes rule, rule gets enforced. The people closest to the output drive the loop. The engineers encode what they catch.

The horse

There’s a reason “harness” is the right word for this.

A harness doesn’t slow a horse down. It channels the force. The horse is still running flat out — the power is still there. What the harness adds is direction, load-bearing structure, and the ability to actually use what the animal can do.

AI reasoning is like that. The model generates keyword lists a human analyst would take hours to produce, spots cross-campaign patterns across hundreds of advertisers simultaneously, classifies at a scale no team could match. None of that goes away. What the harness adds is the structure that makes it usable — scoped tasks, checked outputs, visible decisions.

In a previous post about the 5 levels of AI integration, I described L4 — “Harness-Driven” — as the level where AI operates in a closed feedback loop, with humans defining acceptance criteria. The unsolved problem I flagged: real L4 needs to be legible to someone other than its author. The failure taxonomy is the answer.

Not an elegant one. The one that works.

The goal was never to limit the AI. It was to stop hoping it was right and build something that could know.