Topic for a Kevin Badinger blog post

# RAG Is a System, Not a Vector Database

Last month I watched a team blow $400K on a RAG implementation that couldn't answer basic questions about their own product docs. They had the fanciest vector database. They had embeddings from the latest models. They had cosine similarity scores that looked great on paper.

And their system was useless.

You know why? Because they built a vector database when they needed to build a system.

## The Vector Database Trap

Here's how most teams approach RAG: grab Pinecone or Weaviate, throw documents at it, wire up some embeddings, and call it done. Congratulations, you've built the AI equivalent of a keyword search that costs 100x more and works half as well.

I see three failure modes over and over:

**The "Close But Wrong" Problem**

Your user asks "What's our policy on remote work?" and your RAG retrieves three paragraphs about removing work computers from the office. Lexically similar? Sure. Semantically correct? Not even close. But your vector similarity score is 0.89 so everything must be fine, right?

**Context Window Chaos**

You retrieve 12 relevant documents. Great! Except now you've stuffed 30,000 tokens into your context window and the model can't find the actual answer buried in paragraph 47 of document 8. You turned a precision instrument into a confused undergraduate trying to write a term paper at 3am.

**The Staleness Problem Nobody Talks About**

Your vectors are three weeks old. The source documents changed yesterday. But vector databases don't come with freshness checks, so you're confidently serving outdated information. In healthcare? That's not a bug, it's a lawsuit.

## How You Actually Fix This Mess

Fixing retrieval intent isn't about better embeddings. You need hybrid retrieval. Yes, use vectors. But also use BM25 keyword matching. And metadata filters. And user context. I typically see 40% improvement just from adding keyword search back in. Turns out computers were good at finding documents before we decided everything needed to be a 768-dimensional vector.

For context pollution, you need two things: a reranker and a context budget. Don't just dump everything into the prompt. Score your retrievals. Order them. And then consciously decide how much context you're sending. I usually start with a 4,000 token budget and tune from there. Better to send three perfect paragraphs than twelve mediocre pages.

And freshness? You need probes. Actual monitoring. Not "the pipeline ran successfully" but "when I query for document X, do I get the version from timestamp Y?" Run these checks every hour. Log mismatches. Alert on staleness. Boring? Yes. Critical? Also yes.

## The Healthcare Wake-Up Call

But here's where it gets serious. In healthcare AI, your RAG isn't just serving answers. It's creating an audit trail. And if you can't show exactly how you got from "patient asked about drug interactions" to "model recommended stopping medication," you're done.

Every chunk needs provenance. Not just "came from document X" but "came from paragraph Y of version Z of document X, ingested at timestamp A by process B." Every prompt assembly needs to be reproducible. Every model response needs full traceability.

You think I'm being paranoid? I watched a healthcare startup get shut down because they couldn't prove their AI wasn't making up drug dosages. Their embeddings were perfect. Their similarity scores were beautiful. But when the auditor asked "show me exactly which document said to prescribe 50mg instead of 5mg," they had nothing.

## What I'd Actually Build

Here's the architecture I'd ship for a regulated RAG system:

Start with a source-of-truth ingestion pipeline. Not "throw files in S3." I mean versioned documents with checksums, change tracking, and approval workflows. Every document has an owner. Every change has a timestamp. Every version has a hash.

Next, chunk with explicit boundaries. Don't let your chunking algorithm split a dosage from its medication or a contraindication from its context. In healthcare, chunk boundaries aren't about token counts. They're about semantic completeness. I'd rather have variable-size chunks that preserve meaning than uniform chunks that destroy it.

For retrieval, go hybrid from day one. Vector search for semantic similarity. Keyword search for exact matches. Metadata filters for document types, departments, date ranges. And yes, this means you're running three different indices. That's not redundancy, it's reliability.

Add a reranker. I like Cohere's reranker but even a simple cross-encoder works. The point is don't trust your initial retrieval. Let a second model sort out what's relevant from what's just similar.

Build a context assembler with an explicit budget. Decide how much context you're sending before you send it. Track token counts. Log what you included and what you cut. In healthcare, what you didn't include might matter as much as what you did.

Finally, wrap it all in comprehensive logging. Not just the answer but the full provenance chain. Query -> Retrieved Documents -> Reranking Scores -> Context Assembly -> Prompt -> Response. Every step. Every timestamp. Every decision.

## The Parts Everyone Skips

Two things kill most RAG deployments: freshness and provenance.

Freshness seems boring until you're serving last month's clinical guidelines during this month's outbreak. You need active freshness monitoring. Synthetic queries that verify specific documents. Timestamp comparison between your vectors and your source truth. Alerts when drift exceeds your threshold.

Provenance feels like paperwork until an auditor asks how your AI made a specific recommendation. You need the full chain. Not just "we used RAG" but "this specific query retrieved these specific chunks from these specific document versions and assembled them in this specific order to create this specific prompt which generated this specific response."

Build these from the start. Retrofitting audit trails onto a production system is like adding foundations after you've built the house.

## Why This Actually Matters

Most teams think RAG is about the retrieval. Or the model. Or the embeddings. But RAG is about the system. The boring parts. The operational discipline. The audit trails.

You want to build AI for healthcare? Or finance? Or any domain where wrong answers have real consequences? Stop thinking about RAG as a vector database with an API. Start thinking about it as a production data system that happens to use embeddings.

Because that's what it is. And until you treat it like one, you're just building expensive demos.

Your move is simple: Look at your RAG implementation. Can you trace every answer back to its source? Can you prove your information is current? Can you show an auditor exactly how your system made each decision?

If not, you don't have a RAG system. You have a vector database with aspirations.

Fix the system. The vectors will take care of themselves.

Topic for a Kevin Badinger blog post — second arti...