Topic for a Kevin Badinger blog post on LLM evalua...

# Evals: The Production Discipline Most LLM Teams Skip

Last week I watched a team ship an LLM feature that worked great in demos. Two weeks later, their model provider pushed a minor update. The feature started hallucinating customer names in support tickets. Nobody caught it for three days.

You know what would have caught it? A basic eval suite. But they didn't have one. Most teams don't.

## Evals Are Not Unit Tests

People keep trying to write unit tests for AI. That's not what evals are. Unit tests verify predictable outputs. Evals measure three things:

**Capability** - Does the model do the thing you built it for? Not in theory. In production. With real messy inputs.

**Drift** - Has its behavior changed between model versions? This is the killer. OpenAI updates GPT-4 every few weeks. Claude gets tweaks monthly. Your perfectly-tuned prompt from January might be garbage by March.

**Safety** - Does it refuse what it should refuse? Does it allow what it should allow? In healthcare contexts, this is the difference between a useful tool and a lawsuit.

I've seen teams treat evals like they're optional. Like they're nice-to-have after you ship. That's backwards. The eval suite IS part of the system.

## Why Teams Skip Them

Three reasons teams skip evals, and they're all bad reasons:

First, evals don't ship features. Product managers don't get excited about test coverage for AI. They want the chatbot live yesterday. So teams cut corners on the boring stuff that keeps the chatbot from going rogue.

Second, they take time to build. A decent eval suite needs real thought. You can't just generate test cases. You need to understand your actual failure modes. That means sitting down and thinking through how your system breaks.

Third, the first version always feels wrong. You'll build an eval suite, run it, and realize you're measuring the wrong things. So you'll rebuild it. Then rebuild it again. This feels like waste. It's not. It's learning what actually matters.

## The Minimum Eval Surface

If I'm greenlighting an LLM workflow for production, here's the bare minimum I need to see:

**Golden Set** - About 50 input/output pairs for the core task. Real examples, not synthetic. Pull them from actual usage if you can. If you're building a SQL generator, these are real questions users asked and the SQL that actually worked.

**Adversarial Set** - About 10 deliberately broken inputs. Malformed JSON. Injection attempts. Edge cases you know will happen. The goal isn't to make the model fail gracefully (though that's nice). The goal is to know HOW it fails.

**Drift Comparison** - Run your golden set against the previous model version. Same inputs, compare outputs. This is your canary. When outputs start changing, you need to know immediately, not when customers complain.

That's it. 60 test cases and a comparison script. You can build this in an afternoon.

## Keep the Framework Boring

Every team I've seen tries to build an eval platform. Don't. You'll spend six months on infrastructure and zero months on actual evals.

Use what you have. JSON files for test cases. Your existing test runner. A CSV for results. Maybe a simple dashboard if you're feeling fancy.

I store our evals in a folder called `evals/`. Each test is a JSON file with input, expected output, and tags. We run them with pytest. Results go to a CSV. Total complexity: about 200 lines of Python.

The boring choice is the right choice. You want to spend time writing test cases, not building test infrastructure.

## What Evals Catch That Humans Don't

Humans are terrible at noticing slow changes. You won't notice that the model is 5% worse at handling edge cases. You won't notice that it's slightly more verbose this week than last week.

But evals notice. That's their job.

Last month, one of our evals caught that GPT-4 had started being more conservative about SQL generation. It wasn't wrong, exactly. It just started wrapping everything in unnecessary COALESCE statements. Performance tanked. Our monitoring didn't catch it because the queries still worked. But our evals showed a 30% increase in query complexity overnight.

Without evals, that would have been a slow bleed. A gradual degradation that nobody quite noticed until some VP asked why the dashboards were loading slower.

## Healthcare: When Evals Become Audit Trails

In healthcare AI, evals aren't just good practice. They're your audit trail.

When a regulator asks "How do you know your model isn't hallucinating medication names?", you can't just say "We tested it." You need to show exactly how you tested it. With what data. How often. What the results were.

Your eval suite becomes part of your compliance documentation. Every run is logged. Every drift is documented. Every model update triggers a full eval run before deployment.

I've seen teams try to bolt on compliance after the fact. It's painful. Build it into your eval system from day one. Log everything. Version everything. Treat eval results as production data.

In one healthcare project, we caught a model drift that was substituting generic drug names for brand names. Technically correct, but it confused patients who knew their meds by brand. Our pharmacist consultant spotted it in the eval review. That's the kind of subtle failure that only domain experts catch, but evals make visible.

## Growing Your Eval Suite

Your eval suite isn't static. Every production issue generates new test cases. Every customer complaint becomes an eval. Every edge case you didn't think of gets added to the suite.

We add about 5 new test cases per week. Some weeks more. Some weeks none. But the suite grows with the product.

Track eval results over time. Graph them. You'll start to see patterns. Maybe the model gets worse at a specific task every time there's an update. Maybe certain types of inputs are always brittle.

One pattern we've noticed: models get worse at saying "I don't know" over time. They get more confident with each update. More willing to hallucinate. Our evals track this explicitly now.

The eval suite becomes institutional memory. It remembers all the ways your system has failed before. It checks for them every time you deploy.

Start small. 10 test cases are better than zero. 50 is better than 10. But start today. That LLM feature you're shipping next week? It needs evals. Not later. Now.

Because in three weeks, when the model provider pushes an update that breaks your carefully crafted prompts, you'll want to know immediately. Not when customers start complaining. Not when your support queue explodes.

You'll know because your evals told you. And you'll fix it before anyone notices.

That's the difference between shipping AI features and operating AI systems. The features are easy. The operations require discipline. Evals are where that discipline lives.