Building RAG that actually works
RAG is simple in theory, brittle in practice. Here's how retrieval actually flows, what the LLM should and shouldn't do, and why your chunking strategy matters more than your model choice.
I thought RAG was straightforward. Embed your documents, throw them in a vector database, retrieve on query, generate an answer. The architecture diagrams make it look like plumbing.
Then I shipped one to production.
The retrieval worked fine on my test queries, then fell apart on real user questions. Chunks that scored high on similarity had nothing to do with what users actually needed, and the model hallucinated confidently when the right context was buried at position 7 instead of position 1. I spent more time debugging retrieval failures than I did building the initial system.
What I learned is that the boring parts matter most. How you chunk documents—respecting the structure of Q&A pairs, instruction manuals, financial statements—determines whether retrieval works. Your embedding model choice cascades through everything downstream. And despite the proliferation of eval frameworks, there’s still no reliable way to know if your RAG system actually works until real users hit it.
This post is what I wish I’d known before building my first production system.
When to use AI (and when not to)
Before adding AI to any system, ask: can this be solved deterministically?
| Problem | LLM needed? | Better approach |
|---|---|---|
| “Find documents about X” | Maybe | Keyword search + filters first |
| “Summarize this document” | Yes | LLM with full context |
| “Answer questions about our docs” | Yes | RAG |
| “Route user to correct department” | Probably not | Decision tree or classifier |
| “Extract fields from invoice” | Depends | Regex/templates for structured formats |
The rule: deterministic systems are predictable, testable, and cheap. Use AI only when the problem genuinely requires reasoning over unstructured data.
RAG is appropriate when:
- Users ask natural language questions
- Answers exist in your documents but aren’t directly indexable
- The question requires synthesizing information across sources
RAG is overkill when:
- A SQL query would work
- Users know exactly what they’re looking for
- Your data is already structured
What RAG actually means
RAG stands for Retrieval-Augmented Generation, and the name tells you exactly what it does: retrieve relevant context, augment the prompt with it, then generate an answer. It’s not a framework or a library—it’s a pattern. Search first, then generate.
The core insight is simple: LLMs don’t know your data, so you fetch the relevant parts and hand them over at inference time.
The LLM never sees your entire knowledge base. It only sees the chunks you retrieve. This is both the power and the limitation of RAG.
Anatomy of a RAG request
Let’s trace a real request through the system.
Step 1: Embed the query
func (s *RAG) Query(ctx context.Context, question string) (string, error) {
// Convert question to vector
embedding, err := s.embedder.Embed(ctx, question)
if err != nil {
return "", fmt.Errorf("embed query: %w", err)
}
The embedding model converts the question into a vector—a list of floats that represent semantic meaning. Questions with similar meaning produce similar vectors.
Step 2: Retrieve relevant chunks
// Search vector database
chunks, err := s.vectorDB.Search(ctx, SearchRequest{
Vector: embedding,
TopK: 10,
Filter: Filter{
Namespace: "docs",
},
})
if err != nil {
return "", fmt.Errorf("search: %w", err)
}
The vector database finds chunks whose embeddings are closest to the query embedding. This is semantic search—it finds conceptually similar content, not just keyword matches.
Step 3: (Optional) Rerank
// Rerank for relevance
if s.reranker != nil {
chunks, err = s.reranker.Rerank(ctx, question, chunks)
if err != nil {
return "", fmt.Errorf("rerank: %w", err)
}
chunks = chunks[:min(5, len(chunks))]
}
Embedding similarity isn’t perfect. A reranker (like Cohere Rerank 3.5 or open source alternatives like bge-reranker-v2.5-gemma2-lightweight [9]) scores each chunk against the actual question, improving precision. In Anthropic’s benchmarks, combining reranking with contextual embeddings and BM25 reduced retrieval failures by up to 67% [2]—though results vary by domain.
Step 4: Build the prompt
// Construct prompt with context
context := formatChunks(chunks)
prompt := fmt.Sprintf(`Answer the question based on the provided context.
If the context doesn't contain the answer, say "I don't have information about that."
Context:
%s
Question: %s
Answer:`, context, question)
This is the “augmentation” part. We inject retrieved context into the prompt, giving the LLM the information it needs to answer.
Step 5: Generate
// Generate answer
response, err := s.llm.Complete(ctx, CompletionRequest{
Model: "gpt-4o",
Messages: []Message{{Role: "user", Content: prompt}},
})
if err != nil {
return "", fmt.Errorf("generate: %w", err)
}
return response.Content, nil
}
The LLM synthesizes an answer from the context. If your retrieval worked, the answer is grounded in your documents. If retrieval failed, you get hallucinations.
What the LLM should do vs. what should be deterministic
This is where most RAG systems go wrong. They use the LLM for everything, when much of the pipeline should be deterministic.
The core issue: LLMs are probabilistic text generators. They’re excellent at language tasks—synthesis, rephrasing, reasoning about meaning. They’re unreliable at everything else.
Let the LLM handle:
- Synthesizing answers from multiple chunks
- Rephrasing information in the user’s terms
- Reasoning about whether context answers the question
- Admitting uncertainty when context is insufficient
Keep deterministic:
- Query routing — Use rules or a classifier, not an LLM call
- Chunk selection — Vector search + reranking, not LLM-based filtering
- Source attribution — Track chunk IDs through the pipeline
- Access control — Filter by permissions before retrieval, never after
- Calculations — Do the math in code, pass results to the LLM
Why prompt-encoded logic breaks
I learned this the hard way: encoding business rules in prompts is brittle. You’re one grammatical ambiguity away from broken behavior.
# Prompt v1 - works in testing
"If the user asks about pricing, only show plans they're eligible for"
# What happens in production:
# - "eligible" is interpreted differently across requests
# - Edge cases (trials, legacy plans, enterprise overrides) aren't handled
# - Model sometimes shows ineligible plans "for context"
The same applies to calculations. Asking the LLM to compute invoice totals, apply discounts, or calculate dates produces inconsistent results—even when it “knows” the formula. The model might round differently, misinterpret currency, or apply operations in the wrong order.
// BAD: LLM does the math
prompt := `Calculate the total for these line items with 15% discount...`
// Sometimes rounds wrong, sometimes applies discount per-item vs total
// GOOD: Code does the math, LLM explains it
total := calculateTotal(lineItems, discount) // Deterministic
prompt := fmt.Sprintf(`The total is $%.2f. Explain this to the user...`, total)
Access control is the critical case
Passing user permissions into the metadata filter is the only secure way to do access control in RAG. Not post-retrieval filtering. Not prompt instructions. Metadata filters at query time.
// GOOD: Deterministic filtering before search
func (s *RAG) Query(ctx context.Context, userID, question string) (string, error) {
// Get user's accessible namespaces BEFORE any AI
namespaces, err := s.permissions.GetNamespaces(ctx, userID)
if err != nil {
return "", err
}
// Search only permitted documents
chunks, err := s.vectorDB.Search(ctx, SearchRequest{
Vector: embedding,
Filter: Filter{
Namespaces: namespaces, // Deterministic filter
},
})
// ...
}
// BAD: Asking the LLM to filter
prompt := `Here are some documents. Only use ones the user has access to...`
// The LLM might ignore this instruction
The principle: LLMs are unreliable gatekeepers. Any security, correctness, or numerical constraint must be enforced in code, not in prompts.
Chunking matters more than you think
The quality of your RAG is bounded by the quality of your chunks. Bad chunking = bad retrieval = bad answers.
The chunking problem
Documents don’t have natural boundaries. A PDF might have:
- Headers that span multiple topics
- Tables that need surrounding context
- Code blocks that shouldn’t be split
- References that point elsewhere
Naive chunking (split every N tokens) destroys this structure.
Chunking strategies
Fixed-size with overlap:
func ChunkFixed(text string, size, overlap int) []string {
var chunks []string
for i := 0; i < len(text); i += size - overlap {
end := min(i+size, len(text))
chunks = append(chunks, text[i:end])
}
return chunks
}
Simple but loses semantic boundaries. A sentence about “the previous section” becomes meaningless.
Semantic chunking:
func ChunkSemantic(text string) []string {
// Split on natural boundaries
sections := splitOnHeaders(text)
var chunks []string
for _, section := range sections {
if tokenCount(section) > maxChunkSize {
// Recursively split large sections on paragraphs
chunks = append(chunks, splitOnParagraphs(section)...)
} else {
chunks = append(chunks, section)
}
}
return chunks
}
Preserves document structure but requires format-aware parsers—Markdown headers, HTML DOM, PDF layout analysis. This isn’t a simple string split; you need different logic for each document type.
Hierarchical chunking:
type Chunk struct {
ID string
Content string
ParentID string // Link to larger context
Level int // Document > Section > Paragraph
}
Store chunks at multiple granularities. Retrieve at paragraph level, but include section context in the prompt.
Chunk size tradeoffs
| Use case | Recommended size | Overlap |
|---|---|---|
| Default baseline | 512 tokens | 50-100 tokens (10-20%) |
| Fact-based queries | 128-256 tokens | 10% |
| Context-heavy tasks | 512-1024 tokens | 15-20% |
| Technical documentation | 400-500 tokens | 10% |
NVIDIA research confirms 512-1024 tokens as the optimal range for most use cases, with performance degrading at 2048+ tokens. Smaller chunks for precise facts, larger for contextual understanding. There’s no universal answer—test with your actual queries.
Embeddings determine retrieval quality
Your embedding model defines the “meaning space” your search operates in.
The embedding bottleneck
If your embedding model doesn’t understand domain terminology, retrieval fails:
Query: "What's the P99 latency for the auth service?"
Chunks:
1. "Authentication response times at the 99th percentile..." ← Should match
2. "The latency of network requests..." ← Partial match
3. "P99 refers to the 99th percentile of a distribution..." ← Definitional, not answer
If the embedding model doesn't know "P99" = "99th percentile",
chunk 1 might not be the top result.
Embedding model choices
| Model | Dimensions | Speed | Quality | Notes |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | Fast | Good | Cheap, solid baseline |
| OpenAI text-embedding-3-large | 3072 | Medium | Better | Higher cost, better for complex domains |
| Cohere embed-v4.0 | 1536 | Fast | Excellent | 128K context, multimodal support |
| Voyage-3.5 / voyage-3-large | 1024-2048 | Fast | Excellent | Best for code, 32K context |
| NV-Embed-v2 | 4096 | Medium | SOTA | Research license only |
| stella_en_1.5B_v5 | up to 8192 | Medium | Excellent | Commercial-friendly, flexible dimensions |
| BGE-M3 | 1024 | Fast | Good | Dense + sparse + ColBERT, 100+ languages |
For most applications, text-embedding-3-small is fine. Upgrade if you see retrieval failures in evals. The open source landscape has matured significantly—BGE-M3 and stella models are now production-ready alternatives to commercial APIs.
When to fine-tune embeddings
Fine-tuning embedding models is rarely necessary but sometimes critical:
- Domain-specific jargon that general models miss
- Specific relevance criteria (e.g., “recent” should rank higher)
- Cross-lingual retrieval in underrepresented languages
Most teams should exhaust other options (better chunking, hybrid search, reranking) before fine-tuning embeddings.
Hybrid search: combining vectors and keywords
Pure vector search has failure modes. So does keyword search. Combine them. This isn’t controversial—Microsoft, Databricks, Neo4j, and Anthropic all endorse hybrid search as best practice, with benchmarks showing 20% recall improvement over vector-only approaches.
func (s *RAG) HybridSearch(ctx context.Context, query string, embedding []float32) ([]Chunk, error) {
// Vector search for semantic similarity
vectorResults, err := s.vectorDB.Search(ctx, embedding, 20)
if err != nil {
return nil, err
}
// Keyword search for exact matches
keywordResults, err := s.textSearch.Search(ctx, query, 20)
if err != nil {
return nil, err
}
// Reciprocal Rank Fusion
return reciprocalRankFusion(vectorResults, keywordResults, k=60), nil
}
func reciprocalRankFusion(lists ...[]Chunk) []Chunk {
scores := make(map[string]float64)
chunks := make(map[string]Chunk)
for _, list := range lists {
for rank, chunk := range list {
scores[chunk.ID] += 1.0 / float64(rank + 60) // RRF formula
chunks[chunk.ID] = chunk
}
}
// Sort by combined score
// ...
}
Hybrid search catches:
- Exact terminology that vectors miss (“error code 503”)
- Semantic matches that keywords miss (“the server is down” → “503 errors”)
Beyond naive RAG
The basic retrieve-and-generate pattern has known limitations. Two techniques worth knowing:
Contextual Retrieval
Anthropic’s Contextual Retrieval—prepending LLM-generated context to chunks before embedding—reduced retrieval failures by up to 67% in their benchmarks when combined with BM25 and reranking [2]. Your mileage will vary based on document types and query patterns, but the technique addresses a real problem: chunks like “revenue grew 3%” are meaningless without knowing which company and quarter.
func contextualizeChunk(chunk, document string) string {
prompt := fmt.Sprintf(`Document: %s
Chunk: %s
Provide a short context (2-3 sentences) explaining what this chunk
is about and how it fits in the broader document.`, document, chunk)
context, _ := llm.Complete(ctx, prompt)
return context + "\n\n" + chunk
}
Cost is manageable with prompt caching (~$1 per million tokens for the contextualization pass).
GraphRAG
Standard vector RAG fails “global” questions—queries about themes, patterns, or summaries across your entire corpus. “What are the main risks mentioned across all quarterly reports?” can’t be answered by retrieving the top-5 similar chunks.
Microsoft’s GraphRAG extracts knowledge graphs from documents automatically, builds community summaries at multiple levels, and uses these for comprehensive answers. The indexing cost is high, but LazyGraphRAG (December 2024) achieves similar quality at 0.1% of the cost.
Use GraphRAG when users ask questions about your entire dataset, not just specific facts within it.
The uncomfortable truth about RAG evals
There’s no silver bullet for evaluating RAG systems. I’ve tried them all—Ragas, DeepEval, LangSmith, custom solutions—and every approach has fundamental limitations.
Why RAG evals are hard
The LLM-as-judge problem. Most automated eval frameworks use GPT-4 or Claude to score outputs. You’re using one black box to evaluate another. The judge model has its own biases, and when it disagrees with your RAG system, you don’t know which one is wrong.
Metrics don’t transfer. A faithfulness score that works for customer support documentation means nothing for legal contracts or medical records. Every domain has different quality thresholds, and the same score can indicate success or failure depending on context.
Synthetic test sets lie. You can generate thousands of question-answer pairs from your documents, but they won’t match your real query distribution. Users ask questions in ways you don’t anticipate, with typos, incomplete context, and assumptions about what the system knows.
Offline evals miss production failures. Your test set covers the queries you thought of. Production reveals the queries you didn’t. A system can score 0.95 on faithfulness and still fail catastrophically on the 5% of queries that matter most.
What actually works
Despite these limitations, you need some evaluation strategy. Here’s what I’ve found useful:
Retrieval metrics as a baseline:
- Recall@K — Are the relevant chunks in the top K results?
- MRR — How high does the first relevant chunk rank?
- Context Precision — Are the retrieved chunks actually relevant to the query?
Generation metrics (separate from retrieval):
- Faithfulness — Is the answer derived from the retrieved chunks, or is the model hallucinating?
- Answer Relevance — Does the answer actually address what was asked?
This distinction matters. You can have perfect retrieval (high context precision) and still get unfaithful answers if the model ignores the context. Ragas popularized this terminology, and it’s worth adopting—it separates retrieval failures from generation failures.
These require labeled data, which means manual work. No shortcut.
type EvalExample struct {
Question string `json:"question"`
ExpectedChunks []string `json:"expected_chunks"`
ExpectedAnswer string `json:"expected_answer"`
}
LLM-as-judge for scale, humans for calibration. Use automated evals to catch regressions, but regularly sample outputs for human review. The automated scores are only meaningful if they correlate with human judgment on your specific domain.
func EvalFaithfulness(ctx context.Context, answer, context string) (float64, error) {
prompt := `Rate whether this answer is faithful to the context.
A faithful answer only makes claims supported by the context.
Context:
%s
Answer:
%s
Score from 0.0 to 1.0, where 1.0 is perfectly faithful.
Return only the number.`
resp, err := llm.Complete(ctx, fmt.Sprintf(prompt, context, answer))
// Parse score...
}
Production logging is your real eval. Log every query, retrieved chunks, and response. When users complain or disengage, you have the data to understand why. This matters more than any offline metric.
A/B testing when stakes are high. For critical applications, run new retrieval strategies against a holdout group. User behavior—clicks, follow-up questions, task completion—tells you more than any synthetic benchmark.
The frameworks
- Ragas — Solid baseline metrics (published in EACL 2024), good starting point
- DeepEval — More comprehensive, self-explaining metrics
- RAGChecker (NeurIPS 2024) — Fine-grained claim-level entailment checking, precise error attribution
- TruLens (now Snowflake) — Production monitoring with “RAG Triad” metrics
- Arize Phoenix / LangSmith — Good for tracing and debugging, evals are secondary
Pick one and use it consistently. The framework matters less than having any systematic evaluation at all. But don’t mistake high scores for production readiness—the only real test is real users.
Common failure modes
After debugging many RAG systems, these patterns recur:
1. Retrieval returns garbage
Symptoms: LLM hallucinates or says “I don’t know” when the answer exists.
Causes:
- Embedding model doesn’t understand domain terms
- Chunks are too small, missing context
- No hybrid search, missing keyword matches
Fix: Log retrieved chunks. Manually check if relevant content exists. Improve chunking/search.
2. Right chunks, wrong answer
Symptoms: Relevant chunks retrieved, but LLM ignores them.
Causes:
- Too many chunks, burying the relevant one. This is the “Lost in the Middle” phenomenon (Liu et al., TACL 2024)—LLMs exhibit a U-shaped attention curve, performing best when relevant information appears at the beginning or end of context, with 20+ percentage point degradation for middle positions. This persists even in extended-context models.
- Prompt doesn’t emphasize context
- Model has conflicting pretraining knowledge
Fix: Reduce chunk count. Rerank to put the most relevant chunk first. Explicitly instruct to prefer context over prior knowledge.
3. Answers lack specificity
Symptoms: Vague, generic answers instead of precise information.
Causes:
- Chunks too high-level
- No direct quotes or citations
- Temperature too high
Fix: Smaller chunks. Add citation requirements to prompt. Lower temperature.
4. Inconsistent quality
Symptoms: Sometimes great, sometimes terrible. No pattern.
Causes:
- No evals, so failures aren’t caught
- Edge cases not covered
- Retrieval quality varies by topic
Fix: Build eval set. Identify failure clusters. Add topic-specific handling.
Conclusion
RAG is straightforward in concept: retrieve context, then generate. The difficulty is in the details—chunking strategy, embedding choice, hybrid search, reranking, and evaluation.
The teams that succeed:
- Start simple and measure everything
- Keep retrieval deterministic, let the LLM synthesize
- Build eval sets before optimizing
- Treat RAG as an information retrieval problem, not a magic AI solution
The chunking matters. The embeddings matter. The evals matter. The model choice? Usually matters least.
References & Further Reading
Academic Papers
[1] Lost in the Middle: How Language Models Use Long Contexts Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Transactions of the Association for Computational Linguistics, 12, 157–173. DOI: 10.1162/tacl_a_00638
[5] Ragas: Automated Evaluation of Retrieval Augmented Generation Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2024). EACL 2024 System Demonstrations. arxiv.org/abs/2309.15217
[6] RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation Ru, D., Qiu, L., et al. (2024). NeurIPS 2024. proceedings.neurips.cc
[3] From Local to Global: A Graph RAG Approach to Query-Focused Summarization Edge, D., et al. (2024). arxiv.org/abs/2404.16130
[8] BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., & Liu, Z. (2024). arxiv.org/abs/2402.03216
Industry Resources
[2] Contextual Retrieval — Anthropic (September 2024) anthropic.com/news/contextual-retrieval
[4] LazyGraphRAG: Setting a new standard for quality and cost — Microsoft Research (November 2024) microsoft.com/en-us/research/blog/lazygraphrag-setting-a-new-standard-for-quality-and-cost
[7] Finding the Best Chunking Strategy for Accurate AI Responses — NVIDIA developer.nvidia.com/blog/finding-the-best-chunking-strategy-for-accurate-ai-responses
[9] bge-reranker-v2.5-gemma2-lightweight — BAAI huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight
Hybrid Search Explained — Weaviate weaviate.io/blog/hybrid-search-explained
Chunking Strategies for RAG — Weaviate weaviate.io/blog/chunking-strategies-for-rag
Documentation
- Ragas Documentation — RAG evaluation framework
- Azure AI Search: Hybrid Search — Microsoft’s RRF implementation