Building RAG that actually works

RAG is simple in theory, brittle in practice. Here's how retrieval actually flows, what the LLM should and shouldn't do, and why your chunking strategy matters more than your model choice.

I thought RAG was straightforward. Embed your documents, throw them in a vector database, retrieve on query, generate an answer. The architecture diagrams make it look like plumbing.

Then I shipped one to production.

The retrieval worked fine on my test queries, then fell apart on real user questions. Chunks that scored high on similarity had nothing to do with what users actually needed, and the model hallucinated confidently when the right context was buried at position 7 instead of position 1. I spent more time debugging retrieval failures than I did building the initial system.

What I learned is that the boring parts matter most. How you chunk documents—respecting the structure of Q&A pairs, instruction manuals, financial statements—determines whether retrieval works. Your embedding model choice cascades through everything downstream. And despite the proliferation of eval frameworks, there’s still no reliable way to know if your RAG system actually works until real users hit it.

This post is what I wish I’d known before building my first production system.

When to use AI (and when not to)

Before adding AI to any system, ask: can this be solved deterministically?

Problem LLM needed? Better approach
“Find documents about X” Maybe Keyword search + filters first
“Summarize this document” Yes LLM with full context
“Answer questions about our docs” Yes RAG
“Route user to correct department” Probably not Decision tree or classifier
“Extract fields from invoice” Depends Regex/templates for structured formats

The rule: deterministic systems are predictable, testable, and cheap. Use AI only when the problem genuinely requires reasoning over unstructured data.

RAG is appropriate when:

  • Users ask natural language questions
  • Answers exist in your documents but aren’t directly indexable
  • The question requires synthesizing information across sources

RAG is overkill when:

  • A SQL query would work
  • Users know exactly what they’re looking for
  • Your data is already structured

What RAG actually means

RAG stands for Retrieval-Augmented Generation, and the name tells you exactly what it does: retrieve relevant context, augment the prompt with it, then generate an answer. It’s not a framework or a library—it’s a pattern. Search first, then generate.

The core insight is simple: LLMs don’t know your data, so you fetch the relevant parts and hand them over at inference time.

RAG architecture: indexing, retrieval, and generation phases

The LLM never sees your entire knowledge base. It only sees the chunks you retrieve. This is both the power and the limitation of RAG.

Anatomy of a RAG request

Let’s trace a real request through the system.

Step 1: Embed the query

func (s *RAG) Query(ctx context.Context, question string) (string, error) {
    // Convert question to vector
    embedding, err := s.embedder.Embed(ctx, question)
    if err != nil {
        return "", fmt.Errorf("embed query: %w", err)
    }

The embedding model converts the question into a vector—a list of floats that represent semantic meaning. Questions with similar meaning produce similar vectors.

Step 2: Retrieve relevant chunks

    // Search vector database
    chunks, err := s.vectorDB.Search(ctx, SearchRequest{
        Vector: embedding,
        TopK:   10,
        Filter: Filter{
            Namespace: "docs",
        },
    })
    if err != nil {
        return "", fmt.Errorf("search: %w", err)
    }

The vector database finds chunks whose embeddings are closest to the query embedding. This is semantic search—it finds conceptually similar content, not just keyword matches.

Step 3: (Optional) Rerank

    // Rerank for relevance
    if s.reranker != nil {
        chunks, err = s.reranker.Rerank(ctx, question, chunks)
        if err != nil {
            return "", fmt.Errorf("rerank: %w", err)
        }
        chunks = chunks[:min(5, len(chunks))]
    }

Embedding similarity isn’t perfect. A reranker (like Cohere Rerank 3.5 or open source alternatives like bge-reranker-v2.5-gemma2-lightweight [9]) scores each chunk against the actual question, improving precision. In Anthropic’s benchmarks, combining reranking with contextual embeddings and BM25 reduced retrieval failures by up to 67% [2]—though results vary by domain.

Step 4: Build the prompt

    // Construct prompt with context
    context := formatChunks(chunks)

    prompt := fmt.Sprintf(`Answer the question based on the provided context.
If the context doesn't contain the answer, say "I don't have information about that."

Context:
%s

Question: %s

Answer:`, context, question)

This is the “augmentation” part. We inject retrieved context into the prompt, giving the LLM the information it needs to answer.

Step 5: Generate

    // Generate answer
    response, err := s.llm.Complete(ctx, CompletionRequest{
        Model:    "gpt-4o",
        Messages: []Message{{Role: "user", Content: prompt}},
    })
    if err != nil {
        return "", fmt.Errorf("generate: %w", err)
    }

    return response.Content, nil
}

The LLM synthesizes an answer from the context. If your retrieval worked, the answer is grounded in your documents. If retrieval failed, you get hallucinations.

What the LLM should do vs. what should be deterministic

This is where most RAG systems go wrong. They use the LLM for everything, when much of the pipeline should be deterministic.

The core issue: LLMs are probabilistic text generators. They’re excellent at language tasks—synthesis, rephrasing, reasoning about meaning. They’re unreliable at everything else.

Let the LLM handle:

  • Synthesizing answers from multiple chunks
  • Rephrasing information in the user’s terms
  • Reasoning about whether context answers the question
  • Admitting uncertainty when context is insufficient

Keep deterministic:

  • Query routing — Use rules or a classifier, not an LLM call
  • Chunk selection — Vector search + reranking, not LLM-based filtering
  • Source attribution — Track chunk IDs through the pipeline
  • Access control — Filter by permissions before retrieval, never after
  • Calculations — Do the math in code, pass results to the LLM

Why prompt-encoded logic breaks

I learned this the hard way: encoding business rules in prompts is brittle. You’re one grammatical ambiguity away from broken behavior.

# Prompt v1 - works in testing
"If the user asks about pricing, only show plans they're eligible for"

# What happens in production:
# - "eligible" is interpreted differently across requests
# - Edge cases (trials, legacy plans, enterprise overrides) aren't handled
# - Model sometimes shows ineligible plans "for context"

The same applies to calculations. Asking the LLM to compute invoice totals, apply discounts, or calculate dates produces inconsistent results—even when it “knows” the formula. The model might round differently, misinterpret currency, or apply operations in the wrong order.

// BAD: LLM does the math
prompt := `Calculate the total for these line items with 15% discount...`
// Sometimes rounds wrong, sometimes applies discount per-item vs total

// GOOD: Code does the math, LLM explains it
total := calculateTotal(lineItems, discount) // Deterministic
prompt := fmt.Sprintf(`The total is $%.2f. Explain this to the user...`, total)

Access control is the critical case

Passing user permissions into the metadata filter is the only secure way to do access control in RAG. Not post-retrieval filtering. Not prompt instructions. Metadata filters at query time.

// GOOD: Deterministic filtering before search
func (s *RAG) Query(ctx context.Context, userID, question string) (string, error) {
    // Get user's accessible namespaces BEFORE any AI
    namespaces, err := s.permissions.GetNamespaces(ctx, userID)
    if err != nil {
        return "", err
    }

    // Search only permitted documents
    chunks, err := s.vectorDB.Search(ctx, SearchRequest{
        Vector: embedding,
        Filter: Filter{
            Namespaces: namespaces, // Deterministic filter
        },
    })
    // ...
}

// BAD: Asking the LLM to filter
prompt := `Here are some documents. Only use ones the user has access to...`
// The LLM might ignore this instruction

The principle: LLMs are unreliable gatekeepers. Any security, correctness, or numerical constraint must be enforced in code, not in prompts.

Chunking matters more than you think

The quality of your RAG is bounded by the quality of your chunks. Bad chunking = bad retrieval = bad answers.

The chunking problem

Documents don’t have natural boundaries. A PDF might have:

  • Headers that span multiple topics
  • Tables that need surrounding context
  • Code blocks that shouldn’t be split
  • References that point elsewhere

Naive chunking (split every N tokens) destroys this structure.

Chunking strategies

Fixed-size with overlap:

func ChunkFixed(text string, size, overlap int) []string {
    var chunks []string
    for i := 0; i < len(text); i += size - overlap {
        end := min(i+size, len(text))
        chunks = append(chunks, text[i:end])
    }
    return chunks
}

Simple but loses semantic boundaries. A sentence about “the previous section” becomes meaningless.

Semantic chunking:

func ChunkSemantic(text string) []string {
    // Split on natural boundaries
    sections := splitOnHeaders(text)

    var chunks []string
    for _, section := range sections {
        if tokenCount(section) > maxChunkSize {
            // Recursively split large sections on paragraphs
            chunks = append(chunks, splitOnParagraphs(section)...)
        } else {
            chunks = append(chunks, section)
        }
    }
    return chunks
}

Preserves document structure but requires format-aware parsers—Markdown headers, HTML DOM, PDF layout analysis. This isn’t a simple string split; you need different logic for each document type.

Hierarchical chunking:

type Chunk struct {
    ID       string
    Content  string
    ParentID string  // Link to larger context
    Level    int     // Document > Section > Paragraph
}

Store chunks at multiple granularities. Retrieve at paragraph level, but include section context in the prompt.

Chunk size tradeoffs

Use case Recommended size Overlap
Default baseline 512 tokens 50-100 tokens (10-20%)
Fact-based queries 128-256 tokens 10%
Context-heavy tasks 512-1024 tokens 15-20%
Technical documentation 400-500 tokens 10%

NVIDIA research confirms 512-1024 tokens as the optimal range for most use cases, with performance degrading at 2048+ tokens. Smaller chunks for precise facts, larger for contextual understanding. There’s no universal answer—test with your actual queries.

Embeddings determine retrieval quality

Your embedding model defines the “meaning space” your search operates in.

The embedding bottleneck

If your embedding model doesn’t understand domain terminology, retrieval fails:

Query: "What's the P99 latency for the auth service?"

Chunks:
1. "Authentication response times at the 99th percentile..."  ← Should match
2. "The latency of network requests..."                       ← Partial match
3. "P99 refers to the 99th percentile of a distribution..."   ← Definitional, not answer

If the embedding model doesn't know "P99" = "99th percentile",
chunk 1 might not be the top result.

Embedding model choices

Model Dimensions Speed Quality Notes
OpenAI text-embedding-3-small 1536 Fast Good Cheap, solid baseline
OpenAI text-embedding-3-large 3072 Medium Better Higher cost, better for complex domains
Cohere embed-v4.0 1536 Fast Excellent 128K context, multimodal support
Voyage-3.5 / voyage-3-large 1024-2048 Fast Excellent Best for code, 32K context
NV-Embed-v2 4096 Medium SOTA Research license only
stella_en_1.5B_v5 up to 8192 Medium Excellent Commercial-friendly, flexible dimensions
BGE-M3 1024 Fast Good Dense + sparse + ColBERT, 100+ languages

For most applications, text-embedding-3-small is fine. Upgrade if you see retrieval failures in evals. The open source landscape has matured significantly—BGE-M3 and stella models are now production-ready alternatives to commercial APIs.

When to fine-tune embeddings

Fine-tuning embedding models is rarely necessary but sometimes critical:

  • Domain-specific jargon that general models miss
  • Specific relevance criteria (e.g., “recent” should rank higher)
  • Cross-lingual retrieval in underrepresented languages

Most teams should exhaust other options (better chunking, hybrid search, reranking) before fine-tuning embeddings.

Hybrid search: combining vectors and keywords

Pure vector search has failure modes. So does keyword search. Combine them. This isn’t controversial—Microsoft, Databricks, Neo4j, and Anthropic all endorse hybrid search as best practice, with benchmarks showing 20% recall improvement over vector-only approaches.

func (s *RAG) HybridSearch(ctx context.Context, query string, embedding []float32) ([]Chunk, error) {
    // Vector search for semantic similarity
    vectorResults, err := s.vectorDB.Search(ctx, embedding, 20)
    if err != nil {
        return nil, err
    }

    // Keyword search for exact matches
    keywordResults, err := s.textSearch.Search(ctx, query, 20)
    if err != nil {
        return nil, err
    }

    // Reciprocal Rank Fusion
    return reciprocalRankFusion(vectorResults, keywordResults, k=60), nil
}

func reciprocalRankFusion(lists ...[]Chunk) []Chunk {
    scores := make(map[string]float64)
    chunks := make(map[string]Chunk)

    for _, list := range lists {
        for rank, chunk := range list {
            scores[chunk.ID] += 1.0 / float64(rank + 60) // RRF formula
            chunks[chunk.ID] = chunk
        }
    }

    // Sort by combined score
    // ...
}

Hybrid search catches:

  • Exact terminology that vectors miss (“error code 503”)
  • Semantic matches that keywords miss (“the server is down” → “503 errors”)

Beyond naive RAG

The basic retrieve-and-generate pattern has known limitations. Two techniques worth knowing:

Contextual Retrieval

Anthropic’s Contextual Retrieval—prepending LLM-generated context to chunks before embedding—reduced retrieval failures by up to 67% in their benchmarks when combined with BM25 and reranking [2]. Your mileage will vary based on document types and query patterns, but the technique addresses a real problem: chunks like “revenue grew 3%” are meaningless without knowing which company and quarter.

func contextualizeChunk(chunk, document string) string {
    prompt := fmt.Sprintf(`Document: %s

Chunk: %s

Provide a short context (2-3 sentences) explaining what this chunk
is about and how it fits in the broader document.`, document, chunk)

    context, _ := llm.Complete(ctx, prompt)
    return context + "\n\n" + chunk
}

Cost is manageable with prompt caching (~$1 per million tokens for the contextualization pass).

GraphRAG

Standard vector RAG fails “global” questions—queries about themes, patterns, or summaries across your entire corpus. “What are the main risks mentioned across all quarterly reports?” can’t be answered by retrieving the top-5 similar chunks.

Microsoft’s GraphRAG extracts knowledge graphs from documents automatically, builds community summaries at multiple levels, and uses these for comprehensive answers. The indexing cost is high, but LazyGraphRAG (December 2024) achieves similar quality at 0.1% of the cost.

Use GraphRAG when users ask questions about your entire dataset, not just specific facts within it.

The uncomfortable truth about RAG evals

There’s no silver bullet for evaluating RAG systems. I’ve tried them all—Ragas, DeepEval, LangSmith, custom solutions—and every approach has fundamental limitations.

Why RAG evals are hard

The LLM-as-judge problem. Most automated eval frameworks use GPT-4 or Claude to score outputs. You’re using one black box to evaluate another. The judge model has its own biases, and when it disagrees with your RAG system, you don’t know which one is wrong.

Metrics don’t transfer. A faithfulness score that works for customer support documentation means nothing for legal contracts or medical records. Every domain has different quality thresholds, and the same score can indicate success or failure depending on context.

Synthetic test sets lie. You can generate thousands of question-answer pairs from your documents, but they won’t match your real query distribution. Users ask questions in ways you don’t anticipate, with typos, incomplete context, and assumptions about what the system knows.

Offline evals miss production failures. Your test set covers the queries you thought of. Production reveals the queries you didn’t. A system can score 0.95 on faithfulness and still fail catastrophically on the 5% of queries that matter most.

What actually works

Despite these limitations, you need some evaluation strategy. Here’s what I’ve found useful:

Retrieval metrics as a baseline:

  • Recall@K — Are the relevant chunks in the top K results?
  • MRR — How high does the first relevant chunk rank?
  • Context Precision — Are the retrieved chunks actually relevant to the query?

Generation metrics (separate from retrieval):

  • Faithfulness — Is the answer derived from the retrieved chunks, or is the model hallucinating?
  • Answer Relevance — Does the answer actually address what was asked?

This distinction matters. You can have perfect retrieval (high context precision) and still get unfaithful answers if the model ignores the context. Ragas popularized this terminology, and it’s worth adopting—it separates retrieval failures from generation failures.

These require labeled data, which means manual work. No shortcut.

type EvalExample struct {
    Question       string   `json:"question"`
    ExpectedChunks []string `json:"expected_chunks"`
    ExpectedAnswer string   `json:"expected_answer"`
}

LLM-as-judge for scale, humans for calibration. Use automated evals to catch regressions, but regularly sample outputs for human review. The automated scores are only meaningful if they correlate with human judgment on your specific domain.

func EvalFaithfulness(ctx context.Context, answer, context string) (float64, error) {
    prompt := `Rate whether this answer is faithful to the context.
A faithful answer only makes claims supported by the context.

Context:
%s

Answer:
%s

Score from 0.0 to 1.0, where 1.0 is perfectly faithful.
Return only the number.`

    resp, err := llm.Complete(ctx, fmt.Sprintf(prompt, context, answer))
    // Parse score...
}

Production logging is your real eval. Log every query, retrieved chunks, and response. When users complain or disengage, you have the data to understand why. This matters more than any offline metric.

A/B testing when stakes are high. For critical applications, run new retrieval strategies against a holdout group. User behavior—clicks, follow-up questions, task completion—tells you more than any synthetic benchmark.

The frameworks

  • Ragas — Solid baseline metrics (published in EACL 2024), good starting point
  • DeepEval — More comprehensive, self-explaining metrics
  • RAGChecker (NeurIPS 2024) — Fine-grained claim-level entailment checking, precise error attribution
  • TruLens (now Snowflake) — Production monitoring with “RAG Triad” metrics
  • Arize Phoenix / LangSmith — Good for tracing and debugging, evals are secondary

Pick one and use it consistently. The framework matters less than having any systematic evaluation at all. But don’t mistake high scores for production readiness—the only real test is real users.

Common failure modes

After debugging many RAG systems, these patterns recur:

1. Retrieval returns garbage

Symptoms: LLM hallucinates or says “I don’t know” when the answer exists.

Causes:

  • Embedding model doesn’t understand domain terms
  • Chunks are too small, missing context
  • No hybrid search, missing keyword matches

Fix: Log retrieved chunks. Manually check if relevant content exists. Improve chunking/search.

2. Right chunks, wrong answer

Symptoms: Relevant chunks retrieved, but LLM ignores them.

Causes:

  • Too many chunks, burying the relevant one. This is the “Lost in the Middle” phenomenon (Liu et al., TACL 2024)—LLMs exhibit a U-shaped attention curve, performing best when relevant information appears at the beginning or end of context, with 20+ percentage point degradation for middle positions. This persists even in extended-context models.
  • Prompt doesn’t emphasize context
  • Model has conflicting pretraining knowledge

Fix: Reduce chunk count. Rerank to put the most relevant chunk first. Explicitly instruct to prefer context over prior knowledge.

3. Answers lack specificity

Symptoms: Vague, generic answers instead of precise information.

Causes:

  • Chunks too high-level
  • No direct quotes or citations
  • Temperature too high

Fix: Smaller chunks. Add citation requirements to prompt. Lower temperature.

4. Inconsistent quality

Symptoms: Sometimes great, sometimes terrible. No pattern.

Causes:

  • No evals, so failures aren’t caught
  • Edge cases not covered
  • Retrieval quality varies by topic

Fix: Build eval set. Identify failure clusters. Add topic-specific handling.

Conclusion

RAG is straightforward in concept: retrieve context, then generate. The difficulty is in the details—chunking strategy, embedding choice, hybrid search, reranking, and evaluation.

The teams that succeed:

  1. Start simple and measure everything
  2. Keep retrieval deterministic, let the LLM synthesize
  3. Build eval sets before optimizing
  4. Treat RAG as an information retrieval problem, not a magic AI solution

The chunking matters. The embeddings matter. The evals matter. The model choice? Usually matters least.


References & Further Reading

Academic Papers

[1] Lost in the Middle: How Language Models Use Long Contexts Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Transactions of the Association for Computational Linguistics, 12, 157–173. DOI: 10.1162/tacl_a_00638

[5] Ragas: Automated Evaluation of Retrieval Augmented Generation Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2024). EACL 2024 System Demonstrations. arxiv.org/abs/2309.15217

[6] RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation Ru, D., Qiu, L., et al. (2024). NeurIPS 2024. proceedings.neurips.cc

[3] From Local to Global: A Graph RAG Approach to Query-Focused Summarization Edge, D., et al. (2024). arxiv.org/abs/2404.16130

[8] BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., & Liu, Z. (2024). arxiv.org/abs/2402.03216

Industry Resources

[2] Contextual Retrieval — Anthropic (September 2024) anthropic.com/news/contextual-retrieval

[4] LazyGraphRAG: Setting a new standard for quality and cost — Microsoft Research (November 2024) microsoft.com/en-us/research/blog/lazygraphrag-setting-a-new-standard-for-quality-and-cost

[7] Finding the Best Chunking Strategy for Accurate AI Responses — NVIDIA developer.nvidia.com/blog/finding-the-best-chunking-strategy-for-accurate-ai-responses

[9] bge-reranker-v2.5-gemma2-lightweight — BAAI huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight

Hybrid Search Explained — Weaviate weaviate.io/blog/hybrid-search-explained

Chunking Strategies for RAG — Weaviate weaviate.io/blog/chunking-strategies-for-rag

Documentation