← Back to review

[Kevin: swap in a real eval example from your own...

June 1, 2026

[Kevin: swap in a real eval example from your own build on review]

# How You Decide an AI System Is Good Enough to Ship When "Wrong" Means a Denied Prescription

Most work in deploying AI systems is deciding when to trust them, not building models. The actual building part? That's the easy bit now.

Different applications have different stakes. Chatbot recommends the wrong pizza topping, who cares. Prior authorization system denies someone their medication? That's a real problem. Or it approves something it shouldn't and now you're explaining fraudulent claims to regulators.

The challenge is knowing how good your model is on cases that actually show up. Not on your clean test data. On the messy, contradictory, scanned-sideways PDFs that real people submit.

Most teams fail because their evaluation methods flatter the system. They test on examples that look like what they trained on. Then they ship.

## The Trap: Demo Passes but System Is Broken

You've seen this pattern. Team builds an extraction pipeline. Maybe it's pulling diagnoses from clinical notes or coverage details from payer documents. They test it on a few examples, eyeball the output, looks good. Ship it.

Sometimes they get fancy and use the model to grade itself. "Hey GPT-4, did you extract this correctly?" And GPT-4 says yes 98% of the time. Because of course it does.

One month later the numbers don't reconcile. Someone does a spot check on 50 real cases. Finds 3 that are confidently wrong. That's a 6% error rate on critical healthcare decisions. Nobody caught it because nobody looked at the right 50 cases.

Why does this happen? Demos use clean inputs. Real documents are scanned at angles, have coffee stains, contradict themselves across pages. The difference between demo data and production data is where systems die.

The actual deliverable isn't the model. It's the evaluation system that tells you whether the model is safe to use.

## Worked Example: Catching the 5% That Fails Silently

Let's say you're extracting whether a drug requires step therapy from payer coverage documents. Get this wrong and someone either can't fill their prescription or gets improper approval.

### Build Golden Set by Hand from Real Inputs

Pull 200 actual coverage documents. Not PDFs from the payer's website. The actual documents your system will see. Include the ugly ones. Scanned sideways. Faxed three times. Contradictory clauses on different pages.

Get a domain expert. Someone who actually does prior authorizations. Have them label each document: - Does this drug require step therapy? Yes/no. - Which specific clause proves your answer? Quote it.

This is slow. This is boring. This is the entire ballgame.

200 hand-labeled cases means many unglamorous afternoons of reading insurance documents. But it's the only way to earn the right to claim a number about how well your system works.

### Write Grading Rubric That Doesn't Let Model Off Easy

Three-part rubric: - Did it get the answer right? - Did it cite the correct clause as the reason? - Did it abstain when the document is genuinely ambiguous?

Getting the right answer for the wrong reason means it'll fail unpredictably next time. If it says "no step therapy required" but cites the wrong section, that's a failure. Even if the answer happens to be right.

Confident guessing on ambiguous documents is more dangerous than routing to a human. Build in points for appropriate abstention.

### Run It and Read Failures, Not Score

First run: 94%. Feels good right?

Wrong. 94% means 6 out of every 100 prescriptions get the wrong answer. That's thousands of people if you're processing any volume.

You need to understand which 6%. Read all 12 failures by hand. What went wrong?

In this example, 11 out of 12 failures were cases where step therapy was required but the requirement was in a footnote or cross-reference. The model read the main body, said "no step therapy," and missed the fine print. Just like a rushed human would.

That's a pattern, not noise. Completely invisible if you just look at "94% accuracy."

### Make Real Decision and Defend It

Now you know the specific failure mode. The fix might be making your system follow cross-references before answering. Or flagging any answer that depends on footnote text for human review.

Re-run your 200 cases. Watch those footnote cases specifically. Does your fix work? Good. Now you can ship.

And when someone asks "how do we know this is safe?" you can say: Here's our test set of real documents. Here's the failure mode we found. Here's the guardrail we built. Here's our performance on that specific problematic slice.

That's a defensible position. "Our model is 98% accurate" is not.

## Principles for Any AI System Where Wrong Is Expensive

**Eval set is deliverable, model is commodity** You can swap models in an afternoon. Claude to GPT-4 to Llama, whatever. You cannot swap a trustworthy eval set in an afternoon. That golden set of 200 labeled cases? That's your real IP. Spend effort accordingly.

**Golden set must use real, ugly inputs** Clean test cases show performance on clean cases. But your production system won't see clean cases. It'll see the document someone photographed with their phone in bad lighting.

**Read failures, never just score** The aggregate number hides everything interesting. 94% random errors across all document types vs 94% where every failure is a footnote issue. Completely different systems. Same number.

**Reward abstention** "I don't know, route to human" is often the correct answer for safety-critical tasks. Punishing abstention trains your model to guess confidently. That's how people get hurt.

**Set bar before seeing results** Decide what error rate you can accept before running evaluation. Otherwise whatever number you achieve becomes the number you needed. "Oh, 94%? Yeah that's... that's probably fine." No. Decide first.

## Why This Is the Whole Job

Building AI systems for healthcare or financial services or any domain where wrong answers hurt people... the question isn't "is it good?" The question is "which specific cases does it fail on and can we live with that?"

The danger isn't bad models. We have good models now. The danger is teams not building the instruments to detect exactly how their models are bad.

The model is the visible part. The evaluation system decides if it's safe for people to use.

When I'm hiring for high-stakes AI deployment, I barely care about model building skills. Can you build an evaluation system that tells the truth about what the model can and can't do? That's the job.