RAG (Retrieval-Augmented Generation) lets AI systems answer questions using your specific data rather than just training knowledge. This guide builds a production-quality RAG system from scratch.

What is RAG?

RAG combines:

  1. Your data (documents, knowledge base, database)
  2. A retrieval system (finds relevant chunks)
  3. An LLM (synthesizes an answer using retrieved context)

Instead of: “What does the AI know about X?” RAG answers: “What does your documentation say about X?”


Architecture Overview

User query

1. Embed query (convert to vector)

2. Similarity search (find relevant doc chunks)

3. Retrieve top-K chunks

4. Build prompt: [context chunks] + [user query]

5. LLM generates answer grounded in your data

Response with citations

Step 1: Choose Your Stack

Embedding model:

  • text-embedding-3-small (OpenAI) — good quality, low cost
  • text-embedding-3-large (OpenAI) — higher quality, higher cost
  • Sentence-transformers (open source) — free, self-hosted

Vector database:

  • Pinecone — managed, production-ready, no infrastructure
  • Chroma — open source, easy to start, local or server
  • Qdrant — open source, strong filtering, self-hosted or cloud
  • pgvector — PostgreSQL extension — if you’re already on Postgres

LLM:

  • Claude Haiku — fast, cheap, good instruction following
  • GPT-4o-mini — comparable alternative
  • Local (Ollama + Llama) — private, no API costs

Step 2: Document Processing

Install dependencies

pip install anthropic openai chromadb langchain pypdf tiktoken

Load and chunk your documents

The chunking strategy significantly affects quality:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, DirectoryLoader

def load_and_chunk_documents(path: str) -> list:
    """Load PDFs and split into chunks."""
    
    # Load documents
    loader = DirectoryLoader(path, glob="**/*.pdf", loader_cls=PyPDFLoader)
    documents = loader.load()
    
    # Split into chunks
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,      # characters per chunk
        chunk_overlap=200,    # overlap between chunks (important!)
        separators=["\n\n", "\n", ". ", " ", ""]
    )
    
    chunks = splitter.split_documents(documents)
    print(f"Created {len(chunks)} chunks from {len(documents)} documents")
    return chunks

Chunking strategy notes:

  • Chunk size 500-1000: Good balance of context and precision
  • Overlap 10-20%: Ensures context isn’t lost at boundaries
  • Separator order matters: Try to split at paragraph/sentence breaks

Step 3: Create and Store Embeddings

import chromadb
from openai import OpenAI

# Initialize clients
openai_client = OpenAI()
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("documents")

def embed_and_store(chunks: list):
    """Create embeddings and store in vector database."""
    
    texts = [chunk.page_content for chunk in chunks]
    metadatas = [chunk.metadata for chunk in chunks]
    ids = [f"chunk_{i}" for i in range(len(chunks))]
    
    # Create embeddings in batches (API limit)
    batch_size = 100
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=batch
        )
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)
        print(f"Embedded {min(i + batch_size, len(texts))}/{len(texts)} chunks")
    
    # Store in ChromaDB
    collection.add(
        embeddings=all_embeddings,
        documents=texts,
        metadatas=metadatas,
        ids=ids
    )
    
    print(f"Stored {len(texts)} chunks in vector database")

Step 4: Query and Retrieval

def retrieve_relevant_chunks(query: str, top_k: int = 5) -> list:
    """Find most relevant document chunks for a query."""
    
    # Embed the query
    query_embedding = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding
    
    # Search vector database
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )
    
    return [
        {
            "text": results["documents"][0][i],
            "metadata": results["metadatas"][0][i],
            "relevance_score": 1 - results["distances"][0][i]  # convert distance to similarity
        }
        for i in range(len(results["documents"][0]))
    ]

Step 5: Generate Answer with Context

import anthropic

claude = anthropic.Anthropic()

def answer_with_rag(question: str, top_k: int = 5) -> dict:
    """Answer a question using RAG."""
    
    # 1. Retrieve relevant chunks
    relevant_chunks = retrieve_relevant_chunks(question, top_k)
    
    # 2. Format context
    context = "\n\n".join([
        f"[Source: {chunk['metadata'].get('source', 'Unknown')}, "
        f"Page: {chunk['metadata'].get('page', 'N/A')}]\n{chunk['text']}"
        for chunk in relevant_chunks
    ])
    
    # 3. Build prompt
    prompt = f"""Answer the question based on the provided context. 
If the answer is not in the context, say so clearly.
Always cite which source(s) you're drawing from.

Context:
{context}

Question: {question}

Answer:"""
    
    # 4. Generate answer
    response = claude.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return {
        "answer": response.content[0].text,
        "sources": [c["metadata"].get("source") for c in relevant_chunks],
        "chunks_used": len(relevant_chunks)
    }

# Usage
result = answer_with_rag("What is our refund policy for digital products?")
print(result["answer"])
print("Sources:", result["sources"])

Step 6: Evaluation

Test your RAG system before production:

test_questions = [
    {
        "question": "What are the refund terms?",
        "expected_topics": ["refund", "policy", "days"]
    },
    {
        "question": "How do I contact support?",
        "expected_topics": ["email", "contact", "support"]
    }
]

for test in test_questions:
    result = answer_with_rag(test["question"])
    print(f"Q: {test['question']}")
    print(f"A: {result['answer'][:200]}...")
    
    # Check if expected topics are mentioned
    topics_found = [t for t in test["expected_topics"] 
                   if t.lower() in result["answer"].lower()]
    print(f"Topics found: {topics_found}/{test['expected_topics']}")
    print("---")

Production Improvements

Hybrid search (semantic + keyword):

# Add BM25 keyword search alongside vector search
# Use RRF (Reciprocal Rank Fusion) to combine results
# Significantly improves recall for exact match queries

Query rewriting:

# Before searching, rewrite the user query for better retrieval
rewrite_prompt = f"""Rewrite this query for document retrieval:
"{query}"
Make it more specific and include key terms that documents would contain."""

Answer evaluation:

# Check if the answer is actually grounded in the retrieved context
# Reduces hallucination in answers
eval_prompt = f"""Does this answer only use information from the context?
Context: {context}
Answer: {answer}
Rate: grounded / partially_grounded / not_grounded"""

Deployment Options

Prototype: ChromaDB in-memory + Claude API Small production: ChromaDB server or Qdrant on a single VM Scale: Pinecone managed + Claude API with caching Privacy-sensitive: Qdrant self-hosted + Ollama local models