RAG (Retrieval-Augmented Generation) lets AI systems answer questions using your specific data rather than just training knowledge. This guide builds a production-quality RAG system from scratch.
What is RAG?
RAG combines:
- Your data (documents, knowledge base, database)
- A retrieval system (finds relevant chunks)
- An LLM (synthesizes an answer using retrieved context)
Instead of: “What does the AI know about X?” RAG answers: “What does your documentation say about X?”
Architecture Overview
User query
↓
1. Embed query (convert to vector)
↓
2. Similarity search (find relevant doc chunks)
↓
3. Retrieve top-K chunks
↓
4. Build prompt: [context chunks] + [user query]
↓
5. LLM generates answer grounded in your data
↓
Response with citations
Step 1: Choose Your Stack
Embedding model:
text-embedding-3-small(OpenAI) — good quality, low costtext-embedding-3-large(OpenAI) — higher quality, higher cost- Sentence-transformers (open source) — free, self-hosted
Vector database:
- Pinecone — managed, production-ready, no infrastructure
- Chroma — open source, easy to start, local or server
- Qdrant — open source, strong filtering, self-hosted or cloud
- pgvector — PostgreSQL extension — if you’re already on Postgres
LLM:
- Claude Haiku — fast, cheap, good instruction following
- GPT-4o-mini — comparable alternative
- Local (Ollama + Llama) — private, no API costs
Step 2: Document Processing
Install dependencies
pip install anthropic openai chromadb langchain pypdf tiktoken
Load and chunk your documents
The chunking strategy significantly affects quality:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
def load_and_chunk_documents(path: str) -> list:
"""Load PDFs and split into chunks."""
# Load documents
loader = DirectoryLoader(path, glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # characters per chunk
chunk_overlap=200, # overlap between chunks (important!)
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} documents")
return chunks
Chunking strategy notes:
- Chunk size 500-1000: Good balance of context and precision
- Overlap 10-20%: Ensures context isn’t lost at boundaries
- Separator order matters: Try to split at paragraph/sentence breaks
Step 3: Create and Store Embeddings
import chromadb
from openai import OpenAI
# Initialize clients
openai_client = OpenAI()
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("documents")
def embed_and_store(chunks: list):
"""Create embeddings and store in vector database."""
texts = [chunk.page_content for chunk in chunks]
metadatas = [chunk.metadata for chunk in chunks]
ids = [f"chunk_{i}" for i in range(len(chunks))]
# Create embeddings in batches (API limit)
batch_size = 100
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=batch
)
batch_embeddings = [item.embedding for item in response.data]
all_embeddings.extend(batch_embeddings)
print(f"Embedded {min(i + batch_size, len(texts))}/{len(texts)} chunks")
# Store in ChromaDB
collection.add(
embeddings=all_embeddings,
documents=texts,
metadatas=metadatas,
ids=ids
)
print(f"Stored {len(texts)} chunks in vector database")
Step 4: Query and Retrieval
def retrieve_relevant_chunks(query: str, top_k: int = 5) -> list:
"""Find most relevant document chunks for a query."""
# Embed the query
query_embedding = openai_client.embeddings.create(
model="text-embedding-3-small",
input=query
).data[0].embedding
# Search vector database
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k,
include=["documents", "metadatas", "distances"]
)
return [
{
"text": results["documents"][0][i],
"metadata": results["metadatas"][0][i],
"relevance_score": 1 - results["distances"][0][i] # convert distance to similarity
}
for i in range(len(results["documents"][0]))
]
Step 5: Generate Answer with Context
import anthropic
claude = anthropic.Anthropic()
def answer_with_rag(question: str, top_k: int = 5) -> dict:
"""Answer a question using RAG."""
# 1. Retrieve relevant chunks
relevant_chunks = retrieve_relevant_chunks(question, top_k)
# 2. Format context
context = "\n\n".join([
f"[Source: {chunk['metadata'].get('source', 'Unknown')}, "
f"Page: {chunk['metadata'].get('page', 'N/A')}]\n{chunk['text']}"
for chunk in relevant_chunks
])
# 3. Build prompt
prompt = f"""Answer the question based on the provided context.
If the answer is not in the context, say so clearly.
Always cite which source(s) you're drawing from.
Context:
{context}
Question: {question}
Answer:"""
# 4. Generate answer
response = claude.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return {
"answer": response.content[0].text,
"sources": [c["metadata"].get("source") for c in relevant_chunks],
"chunks_used": len(relevant_chunks)
}
# Usage
result = answer_with_rag("What is our refund policy for digital products?")
print(result["answer"])
print("Sources:", result["sources"])
Step 6: Evaluation
Test your RAG system before production:
test_questions = [
{
"question": "What are the refund terms?",
"expected_topics": ["refund", "policy", "days"]
},
{
"question": "How do I contact support?",
"expected_topics": ["email", "contact", "support"]
}
]
for test in test_questions:
result = answer_with_rag(test["question"])
print(f"Q: {test['question']}")
print(f"A: {result['answer'][:200]}...")
# Check if expected topics are mentioned
topics_found = [t for t in test["expected_topics"]
if t.lower() in result["answer"].lower()]
print(f"Topics found: {topics_found}/{test['expected_topics']}")
print("---")
Production Improvements
Hybrid search (semantic + keyword):
# Add BM25 keyword search alongside vector search
# Use RRF (Reciprocal Rank Fusion) to combine results
# Significantly improves recall for exact match queries
Query rewriting:
# Before searching, rewrite the user query for better retrieval
rewrite_prompt = f"""Rewrite this query for document retrieval:
"{query}"
Make it more specific and include key terms that documents would contain."""
Answer evaluation:
# Check if the answer is actually grounded in the retrieved context
# Reduces hallucination in answers
eval_prompt = f"""Does this answer only use information from the context?
Context: {context}
Answer: {answer}
Rate: grounded / partially_grounded / not_grounded"""
Deployment Options
Prototype: ChromaDB in-memory + Claude API Small production: ChromaDB server or Qdrant on a single VM Scale: Pinecone managed + Claude API with caching Privacy-sensitive: Qdrant self-hosted + Ollama local models