June 1, 2026
[Kevin: This is adapted from a blog post I wrote after finding a nasty production bug at a prior company. I've anonymized the specifics but kept the lessons intact.]
# When LLMs Meet Production: The Bug That Taught Me Distributed Systems
The first time I chained together a dozen LLM calls in production, I thought the hard part would be getting the prompts right. Wrong.
The hard part was watching a dozen things fail independently, retry independently, and write to the same place. You think one LLM call is a simple function? Sure. Chain a dozen of them? You've built a distributed system made of language models. And if you treat it like a longer function, you're going to have production failures that'll keep you up at night.
## The Expensive Lesson
I learned this at Develop Health, a medication-access company. We'd built what marketing called "over a dozen purpose-built large language model pipelines." The prior authorization process went something like this:
Extraction step. Benefit-check step. Policy-matching step. Drafting step. Review step.
Each step was a model call. Each could fail. Each fed the next. And when you get the orchestration wrong, you don't get errors. You get quiet wrong answers that look right. That's the terrifying part.
## The Bug That Almost Shipped
We had this multi-step intake flow pipeline. Step 3 would extract structured fields from a document and write them to the database. Step 4 would read that record and make a decision.
Step 3 was slow. Client-side timeouts started happening. So we added retry logic for timeouts. Standard practice, right?
Except the first attempt hadn't actually failed. It finished and wrote to the database. The response just got lost on the way back. So our retry ran the whole step again with a fresh model call.
Two writes to the same record.
And here's where LLMs bite you: the output is non-deterministic. Two extractions from the same document don't match. The second write clobbered the first.
Step 4 read the record and made occasionally wrong decisions based on the second extraction. No trace of error. Everything reported success. Just a small percentage of records with wrong values, and you'd never know unless you manually audited them.
## Three Things Were Actually Broken
**First, the retry assumed failure was real.** Client timeout doesn't mean server failure. Our pipeline steps had side effects, they were writing to state. When you retry a half-succeeded step, you run the same work twice and get different answers. With LLMs, different answers are the norm, not the exception.
**Second, the write wasn't idempotent.** Running step 3 twice produced different records because the model is non-deterministic. The second run overwrote the first. Nothing enforced "this work, for this input, lands in exactly one place, once."
**Third, step 4 trusted step 3 silently.** It assumed present fields meant correct and final. No way to know the record had been written twice. No way to know about the upstream partial failure. Downstream components trusting unverifiable upstream state means you're acting on garbage with confidence.
## The Fix: One Writer + Idempotent Writes
The solution has two parts that work together.
**Only the orchestrator writes to durable state.** Individual steps don't write to the record anymore. They return results to the orchestrator. The orchestrator decides what lands in the durable record and when.
Your steps become pure: input in, result out, no side effects. All state management happens in one place. One writer means one place to debug when things go wrong. Multiple writers mean wrong state from interactions you'll never untangle.
**Every write is idempotent, keyed to the work not the attempt.** Each unit of work gets a stable identifier based on the input, the step, and a deterministic key. A write becomes "upsert this result for this key" not "append new result."
Run the step multiple times? Same durable record. Retries become safe. Double-write bugs can't happen by design.
And for the downstream fix: steps can't silently trust previous steps anymore. "Step 3 succeeded" has to be an explicit, checkable fact in the state. Not an assumption based on seeing some populated fields.
## What I Tell Teams Now
A chain of LLM calls is a distributed system, not a long function. Treat it like one.
One writer to durable state. Let the orchestrator own all writes. Every write idempotent, keyed to the work itself. Remember that timeout doesn't equal failure, especially when you're dealing with client vs server boundaries. And no step silently trusts the previous step.
That last one matters more than you think. Confident-wrong propagates faster than errors. At least errors leave a trace.
## Why You Won't See This in Demos
These bugs don't show up in demos. A pipeline with all three bugs runs fine on clean input without timeouts. The problems emerge at scale: 10,000 runs with messy input and a flaky network. A small percentage of races and timeouts becomes a small percentage of wrong answers.
The real work isn't prompting. It's deciding who writes. Making double work harmless. Preventing unverified trust between steps. That's the difference between a demo pipeline and something you can actually ship.
I've seen too many teams learn this the hard way. They start with "let's chain some LLM calls" and end with mysterious data corruption that takes weeks to track down. Save yourself the pain. Build it right from the start.
You know that moment when you realize your clever optimization created a subtle data corruption bug? That's when you really understand distributed systems. Even when they're made of language models.