RAG (Retrieval-Augmented Generation)

Definition

RAG (Retrieval-Augmented Generation) is an architecture pattern that combines information retrieval with LLM generation. Instead of relying solely on the model's parametric (baked-in) knowledge, RAG dynamically retrieves relevant documents from an external knowledge base at query time and injects them into the prompt before generation.

The Core Insight

LLMs know a lot — but their knowledge is frozen at training cutoff, can be wrong, and can't include your private data. RAG solves this by giving the model a "retrieval tool" to look things up at runtime.

Without RAG: Model answers from memory → may hallucinate, outdated

With RAG: Model answers from retrieved documents → grounded, current

RAG Architecture

Standard RAG Pipeline

1. INDEXING (offline, done once):

Documents → Chunking → Embedding → Vector Store

2. RETRIEVAL (online, per query):

User Query → Query Embedding → Vector Search → Top-K Chunks

3. GENERATION (online, per query):

System Prompt + Retrieved Chunks + User Query → LLM → Grounded Answer

Phase 1: Indexing

Document Loading

Load source documents: PDFs, Word docs, websites, databases, code files
Tools: LangChain loaders, LlamaIndex readers, Unstructured.io

Chunking

Split documents into smaller pieces (chunks) that fit in the context window
Strategies:

- Fixed-size: every N tokens with M token overlap

- Recursive character splitting: split on paragraphs > sentences > words

- Semantic chunking: split at semantic boundaries (topic shifts)

- Document-structure aware: respect headers, sections, code blocks

Embedding

Convert each chunk to a dense vector using an embedding model
The vector captures the semantic meaning of the chunk
Common models: OpenAI text-embedding-3-large, Cohere embed-v3, BGE, E5

Storage in Vector Database

Store (chunk_text, embedding_vector, metadata) in a vector DB
Popular vector DBs: Pinecone, Weaviate, Chroma, Qdrant, pgvector, FAISS

Phase 2: Retrieval

Query Embedding

Embed the user's question using the same embedding model used for indexing
Query and chunk embeddings must be in the same vector space

Similarity Search

Find the top-K chunks most similar to the query embedding
Default metric: cosine similarity
K is typically 3–10 chunks

Retrieval Strategies

| Strategy | Description | Best For |

|----------|-------------|---------|

| Semantic (dense) | Embedding-based similarity | Conceptual questions |

| Keyword (sparse, BM25) | TF-IDF term matching | Exact term lookup |

| Hybrid | Combine dense + sparse with RRF | General purpose |

| Contextual compression | Re-rank + compress retrieved chunks | Precision |

| Parent-child | Retrieve child, return parent chunk | Better coherence |

| Multi-query | Generate N query variants, retrieve for each | Recall |

| HyDE | Generate a hypothetical answer, retrieve similar to it | Complex queries |

Phase 3: Generation (Augmented)

System Prompt: "Answer based only on the context below. If not found, say so."

Context:

[Chunk 1: relevant passage from doc A]

[Chunk 2: relevant passage from doc B]

[Chunk 3: relevant passage from doc C]

Question: [user's query]

Answer:

Advanced RAG Techniques

Re-ranking

After initial retrieval, use a cross-encoder to re-rank chunks by relevance
Cross-encoders compare query + chunk together (slower but more accurate)
Models: Cohere Rerank, BAAI/bge-reranker, ColBERT

Query Transformation

Query rewriting: rephrase the query for better retrieval
HyDE: generate a hypothetical document the answer might come from, then retrieve similar
Multi-query: generate multiple query variants to increase recall

Self-RAG

Model decides whether to retrieve (not always necessary)
After retrieval, critiques relevance of retrieved documents
More efficient for mixed queries (some need retrieval, some don't)

Agentic RAG

Agent iteratively retrieves and reasons
"Read this chunk → need more info → retrieve again → synthesize"
Better for complex multi-hop questions

RAG Evaluation Metrics

| Metric | Measures | Tool |

|--------|---------|------|

| Context Precision | Are retrieved chunks relevant? | RAGAS |

| Context Recall | Were all relevant chunks retrieved? | RAGAS |

| Faithfulness | Does answer match retrieved context? | RAGAS, TruLens |

| Answer Relevance | Does answer address the question? | RAGAS |

| End-to-End Accuracy | Is the final answer correct? | Human eval |

RAG vs. Fine-Tuning

| Aspect | RAG | Fine-Tuning |

|--------|-----|-------------|

| Knowledge updates | Easy (add to vector DB) | Requires retraining |

| Custom facts | Excellent | Good |

| Private data | Excellent | Possible but risky |

| Cost | Retrieval infra | GPU compute |

| Hallucination | Lower (grounded) | Lower (domain knowledge) |

| Format/style | Limited | Strong |

Rule of thumb: RAG for knowledge, fine-tuning for behavior/style.

RAG Frameworks and Tools

| Tool | Notes |

|------|-------|

| LangChain | Full RAG pipeline, many integrations |

| LlamaIndex | Document-focused RAG, complex query engines |

| Haystack | Enterprise RAG |

| AWS Bedrock Knowledge Bases | Managed RAG on AWS |

| Azure AI Search | Enterprise hybrid retrieval |

| RAGAS | RAG evaluation framework |

Common RAG Failure Modes

| Failure | Cause | Fix |

|---------|-------|-----|

| Missing relevant chunks | Poor chunking or embedding | Better chunking strategy, hybrid retrieval |

| Model ignores retrieved context | Weak grounding instructions | Stronger system prompt constraints |

| Retrieved wrong chunks | Query-document mismatch | Query rewriting, re-ranking |

| Long chunks overwhelm context | K too large or chunks too big | Smaller chunks, contextual compression |

| Stale knowledge base | Index not updated | Regular re-indexing pipeline |

Related Concepts

Embeddings, Vector Database, Grounding, Hallucination, Context Window, Chunking, Retrieval, LLM