← Back to review

TOPIC: How you build and run an engineering team t...

June 1, 2026

"Show me the eval" should be as automatic as "show me the tests."

That's the rule we live by now, but we learned it the hard way. You know how every team has that one person who owns all the weird tribal knowledge? We had Sarah. She built our entire eval framework for the AI features. Set up the benchmarks, wrote the test cases, knew which metrics mattered and which were noise. Smart as hell.

Then she got pulled into a fire drill that lasted three weeks. And our eval discipline went straight to zero.

## The Trap

You're shipping AI features and feeling good about the velocity. PRs are moving fast. Team's excited about the new prompt improvements. Nobody's asking hard questions because the demos look great.

But you're not running evals on those prompt changes. You're not testing the pipeline modifications against your benchmark datasets. You're treating the AI components like they're somehow different from the rest of your codebase.

They're not. A prompt change can break your product just as hard as a database migration. Actually harder, because at least with a bad migration you get error messages. With prompt drift, you get confident-sounding nonsense that your customers discover first.

## Worked Example

So there we were, three weeks into Sarah being unavailable. The team kept shipping. Changed the core prompt structure for our document analysis feature. Updated the temperature settings. Switched to a new model version because hey, the benchmarks OpenAI published looked better.

Nobody ran our evals. The PR reviews focused on the code changes, not the output changes. "LGTM, ship it."

Two days after deploy, customer complaints started rolling in. The document summaries were suddenly useless. Not obviously broken, just... wrong. Grabbing the wrong sections. Missing key points. Confidently summarizing things that weren't even in the documents.

We rolled back, but the damage was done. Trust eroded. Customers started double-checking every AI output, defeating the whole point of the feature.

The fix wasn't complicated. We just made eval results a required part of every PR that touched AI components. No eval, no merge. Same as unit tests.

But here's where most teams get it wrong: they think this means having an "AI expert" who reviews all the AI changes. That's backwards. You need every engineer on the team to understand evals.

## Principles Under the Example

**Treat prompts like code.** Version control them. Diff them. Review them. Test them. A prompt change is a behavior change, and behavior changes need tests.

**Make evals visible in PRs.** Not just pass/fail. Show the actual outputs. Show how they changed. Make it easy for reviewers to spot regressions.

**Every engineer owns evals.** If you can't run the evals yourself, you shouldn't be changing the prompts. If you can't interpret the results, you shouldn't be reviewing the changes.

**Set up eval infrastructure early.** Before you ship your first AI feature. Test datasets, evaluation metrics, comparison tools. This is table stakes, not nice-to-have.

**Document why each eval exists.** We keep a simple markdown file: what this eval tests, why we care, what failure looks like. When an engineer sees a failing eval six months later, they need context.

## Why This Is the Whole Job

You're not building AI features. You're building software that happens to use AI components. The AI part doesn't make you special. It makes you vulnerable to new failure modes.

We hire for this now. Not looking for prompt engineering wizardry. Looking for engineers who ask "how do we know this works?" before they ask "how do we make this work?"

During interviews, I show candidates a PR with a prompt change and ask them to review it. The good ones immediately ask about tests. Where are the before/after comparisons? What's the eval coverage? How do we know this doesn't break existing functionality?

Skepticism beats enthusiasm every time when you're running production systems.

And here's something we discovered about distributed teams: async eval reviews actually work better than sync ones. You can't hand-wave through eval results in a Slack thread. You have to document what changed and why it's okay. The written trail keeps everyone honest.

We use Notion for our eval documentation. Nothing fancy. Just tables showing input/output pairs, what we expect, what we're measuring. GitHub PR comments for the eval results on each change. Slack threads for the "hey, is this output actually better or just different?" discussions.

The tooling doesn't matter. The discipline does. Every prompt change gets evaluated. Every model swap gets benchmarked. Every temperature adjustment gets tested against your real use cases.

Your AI features are only as good as your worst untested change. And unlike traditional code, where bugs usually announce themselves with errors, AI bugs hide in plain sight. They look like features working correctly, just... worse.

So make "show me the eval" your default. Make it boring. Make it routine. That's how you ship AI features that actually work when your expert is out sick. Or when your team grows beyond the people who built the original system. Or when you're debugging a customer complaint at 2 AM and need to know if the model outputs changed or if your customer's expectations did.

The alternative is learning this lesson the way we did. With angry customers and a three-week cleanup. Your call.