Definition
RAG (Retrieval-Augmented Generation) is an architecture pattern that combines information retrieval with LLM generation. Instead of relying solely on the model's parametric (baked-in) knowledge, RAG dynamically retrieves relevant documents from an external knowledge base at query time and injects them into the prompt before generation.
The Core Insight
LLMs know a lot — but their knowledge is frozen at training cutoff, can be wrong, and can't include your private data. RAG solves this by giving the model a "retrieval tool" to look things up at runtime.
`
Without RAG: Model answers from memory → may hallucinate, outdated
With RAG: Model answers from retrieved documents → grounded, current
`
RAG Architecture
Standard RAG Pipeline
`
1. INDEXING (offline, done once):
Documents → Chunking → Embedding → Vector Store
2. RETRIEVAL (online, per query):
User Query → Query Embedding → Vector Search → Top-K Chunks
3. GENERATION (online, per query):
System Prompt + Retrieved Chunks + User Query → LLM → Grounded Answer
`
Phase 1: Indexing
Document Loading
- Load source documents: PDFs, Word docs, websites, databases, code files
- Tools: LangChain loaders, LlamaIndex readers, Unstructured.io
- Split documents into smaller pieces (chunks) that fit in the context window
- Strategies:
- Convert each chunk to a dense vector using an embedding model
- The vector captures the semantic meaning of the chunk
- Common models: OpenAI text-embedding-3-large, Cohere embed-v3, BGE, E5
- Store (chunk_text, embedding_vector, metadata) in a vector DB
- Popular vector DBs: Pinecone, Weaviate, Chroma, Qdrant, pgvector, FAISS
- Embed the user's question using the same embedding model used for indexing
- Query and chunk embeddings must be in the same vector space
- Find the top-K chunks most similar to the query embedding
- Default metric: cosine similarity
- K is typically 3–10 chunks
- After initial retrieval, use a cross-encoder to re-rank chunks by relevance
- Cross-encoders compare query + chunk together (slower but more accurate)
- Models: Cohere Rerank, BAAI/bge-reranker, ColBERT
- Query rewriting: rephrase the query for better retrieval
- HyDE: generate a hypothetical document the answer might come from, then retrieve similar
- Multi-query: generate multiple query variants to increase recall
- Model decides whether to retrieve (not always necessary)
- After retrieval, critiques relevance of retrieved documents
- More efficient for mixed queries (some need retrieval, some don't)
- Agent iteratively retrieves and reasons
- "Read this chunk → need more info → retrieve again → synthesize"
- Better for complex multi-hop questions
- Embeddings, Vector Database, Grounding, Hallucination, Context Window, Chunking, Retrieval, LLM
Chunking
- Fixed-size: every N tokens with M token overlap
- Recursive character splitting: split on paragraphs > sentences > words
- Semantic chunking: split at semantic boundaries (topic shifts)
- Document-structure aware: respect headers, sections, code blocks
Embedding
Storage in Vector Database
Phase 2: Retrieval
Query Embedding
Similarity Search
Retrieval Strategies
| Strategy | Description | Best For |
|----------|-------------|---------|
| Semantic (dense) | Embedding-based similarity | Conceptual questions |
| Keyword (sparse, BM25) | TF-IDF term matching | Exact term lookup |
| Hybrid | Combine dense + sparse with RRF | General purpose |
| Contextual compression | Re-rank + compress retrieved chunks | Precision |
| Parent-child | Retrieve child, return parent chunk | Better coherence |
| Multi-query | Generate N query variants, retrieve for each | Recall |
| HyDE | Generate a hypothetical answer, retrieve similar to it | Complex queries |
Phase 3: Generation (Augmented)
`
System Prompt: "Answer based only on the context below. If not found, say so."
Context:
[Chunk 1: relevant passage from doc A]
[Chunk 2: relevant passage from doc B]
[Chunk 3: relevant passage from doc C]
Question: [user's query]
Answer:
`
Advanced RAG Techniques
Re-ranking
Query Transformation
Self-RAG
Agentic RAG
RAG Evaluation Metrics
| Metric | Measures | Tool |
|--------|---------|------|
| Context Precision | Are retrieved chunks relevant? | RAGAS |
| Context Recall | Were all relevant chunks retrieved? | RAGAS |
| Faithfulness | Does answer match retrieved context? | RAGAS, TruLens |
| Answer Relevance | Does answer address the question? | RAGAS |
| End-to-End Accuracy | Is the final answer correct? | Human eval |
RAG vs. Fine-Tuning
| Aspect | RAG | Fine-Tuning |
|--------|-----|-------------|
| Knowledge updates | Easy (add to vector DB) | Requires retraining |
| Custom facts | Excellent | Good |
| Private data | Excellent | Possible but risky |
| Cost | Retrieval infra | GPU compute |
| Hallucination | Lower (grounded) | Lower (domain knowledge) |
| Format/style | Limited | Strong |
Rule of thumb: RAG for knowledge, fine-tuning for behavior/style.
RAG Frameworks and Tools
| Tool | Notes |
|------|-------|
| LangChain | Full RAG pipeline, many integrations |
| LlamaIndex | Document-focused RAG, complex query engines |
| Haystack | Enterprise RAG |
| AWS Bedrock Knowledge Bases | Managed RAG on AWS |
| Azure AI Search | Enterprise hybrid retrieval |
| RAGAS | RAG evaluation framework |
Common RAG Failure Modes
| Failure | Cause | Fix |
|---------|-------|-----|
| Missing relevant chunks | Poor chunking or embedding | Better chunking strategy, hybrid retrieval |
| Model ignores retrieved context | Weak grounding instructions | Stronger system prompt constraints |
| Retrieved wrong chunks | Query-document mismatch | Query rewriting, re-ranking |
| Long chunks overwhelm context | K too large or chunks too big | Smaller chunks, contextual compression |
| Stale knowledge base | Index not updated | Regular re-indexing pipeline |