← Back to review

The problem of getting reliable structured output...

June 1, 2026

## The Confidence Trap in Document Processing

I spent three years building document extraction at scale, and the worst bugs always came from the same place: the model giving us perfect-looking wrong answers.

You know how it goes. Your demo extracts every field from that pristine PDF. Ship it to production and suddenly you're parsing forms that look like someone faxed them through a potato. But here's what kills you: the model still returns clean JSON. Every field populated. No errors. Just wrong.

We learned this the hard way processing insurance formularies. The system would confidently tell us Drug X required prior auth when the actual document said the opposite. Or worse, when the document was too blurry to read at all. The extracted data looked perfect. The downstream systems trusted it. Patients got denied medications they should have had access to.

## What Actually Works

After burning through every approach, here's what stuck:

**Force the model to show its work.** For every extracted value, demand the exact text span from the source. Not a paraphrase, not a summary. Character-by-character what it saw. Then validate that span exists in the document.

``` { "drug_tier": { "value": "Tier 2", "source_span": "Tier 2 - Preferred Brand", "page": 47, "confidence": 0.92 } } ```

If the model can't point to where it found something, it didn't find it. Period.

**Make "unknown" a first-class value.** Your schema should explicitly allow fields to be unknown. Not null, not empty string. Unknown. Train the model that admitting ignorance is success, not failure.

**Build retry logic that knows when to quit.** Set a hard retry budget. Three attempts max. Each retry gets the previous attempt's output and error. If it still can't extract cleanly, route to a human. Count that as a win, not a failure.

We tried confidence scores, ensemble voting, all the fancy stuff. None of it beat simple source validation. The model either shows you exactly where it found the answer or it admits it couldn't find one.

## The Human Routing Reality

Here's something most teams get wrong: they treat human review as failure. We flipped that. Human review became our highest confidence path. The system saying "I need help with this one" meant we'd get accurate data, just slower.

Our metrics changed too. Instead of measuring extraction rate, we measured correctness rate. A confident wrong answer hurt way more than an honest "I don't know."

Set up your queues so the easy docs flow through automatically and the weird ones get flagged fast. One pharmacist reviewing 50 flagged documents beats an AI confidently poisoning your database with bad extractions from 5,000.

You working with medical documents specifically? The validation requirements there are brutal but the patterns hold across domains. The messier your inputs, the more you need to force the model to prove its work.