RAG System Architecture: Beyond the Basics
Moving past tutorial-level RAG. Chunking strategies, hybrid search, and evaluation frameworks that actually work.
Every RAG tutorial shows the same thing: chunk your documents, embed them, store in a vector database, retrieve top-k, and prompt an LLM. That gets you 60% of the way there. This article is about the other 40%.
The Tutorial Gap
Tutorial RAG systems fail in production for predictable reasons:
- Fixed-size chunking destroys document structure
- Pure vector search misses keyword matches
- No way to know when retrieval fails
- Context windows overflow with irrelevant chunks
Let's fix each of these.
Chunking That Preserves Meaning
Semantic Chunking
Instead of splitting every 500 tokens, split on semantic boundaries. The algorithm:
- Split into sentences
- Embed each sentence
- Calculate cosine similarity between adjacent sentences
- Split where similarity drops below threshold
def semantic_chunk(text, threshold=0.5):
sentences = split_sentences(text)
embeddings = embed_batch(sentences)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
similarity = cosine_sim(embeddings[i-1], embeddings[i])
if similarity < threshold:
chunks.append(' '.join(current_chunk))
current_chunk = []
current_chunk.append(sentences[i])
return chunks
Document-Aware Chunking
Different document types need different strategies:
- Markdown/HTML: Split on headers, preserve hierarchy
- PDFs: Respect page boundaries, extract tables separately
- Code: Split on function/class boundaries
- Conversations: Keep speaker turns together
Hybrid Search
Vector search is great for semantic similarity but terrible for exact matches. "What is our refund policy for order #12345?" will fail if you only use embeddings.
The Solution: Combine BM25 + Vector
def hybrid_search(query, k=10, alpha=0.5):
# BM25 keyword search
bm25_results = bm25_index.search(query, k=k*2)
# Vector semantic search
query_embedding = embed(query)
vector_results = vector_db.search(query_embedding, k=k*2)
# Reciprocal Rank Fusion
combined = reciprocal_rank_fusion(
[bm25_results, vector_results],
weights=[alpha, 1-alpha]
)
return combined[:k]
In practice, I use α=0.3 (favoring semantic) for general queries and α=0.7 (favoring keyword) for queries with specific identifiers.
Reranking
Initial retrieval is fast but imprecise. Reranking is slow but accurate. Use both:
- Retrieve top 50 candidates with hybrid search
- Rerank with a cross-encoder model
- Take top 5 for the LLM context
Cross-encoders like ms-marco-MiniLM-L-6-v2 are small enough to run on CPU and dramatically improve relevance.
Knowing When Retrieval Fails
The worst RAG failure mode is confident wrong answers. You need to detect when retrieval didn't find relevant content.
Relevance Scoring
After reranking, check the top score. If it's below a threshold, your retrieval probably failed:
def should_retrieve(query, results, threshold=0.5):
if not results:
return False, "No results found"
top_score = results[0].score
if top_score < threshold:
return False, f"Low confidence ({top_score:.2f})"
return True, results
When retrieval fails, have the LLM say "I don't have information about that" instead of hallucinating.
Context Window Management
Don't just concatenate all retrieved chunks. Be strategic:
- Deduplicate: Similar chunks waste tokens
- Summarize: Long chunks can be compressed
- Prioritize: Put most relevant content first (LLMs have primacy bias)
- Cite: Add source references for verification
Evaluation Framework
You can't improve what you can't measure. Build evaluation into your pipeline from day one.
Metrics That Matter
- Retrieval Precision@k: Are retrieved docs relevant?
- Answer Correctness: Is the final answer right?
- Faithfulness: Is the answer grounded in retrieved context?
- Latency: p50, p95, p99 response times
Building a Test Set
Create a golden dataset of 100+ question-answer pairs. Include:
- Easy questions (answer is explicit in one document)
- Multi-hop questions (need to combine information)
- Unanswerable questions (test refusal behavior)
- Questions with specific identifiers (test keyword search)
Production Architecture
My current production stack:
- Embedding: OpenAI text-embedding-3-small (good price/performance)
- Vector DB: Qdrant (fast, good filtering)
- BM25: Elasticsearch or built-in Qdrant sparse vectors
- Reranker: Cohere rerank-english-v2.0
- LLM: Claude 3.5 Sonnet (best instruction following)
Conclusion
Production RAG is 20% retrieval and 80% engineering around the edges: chunking, hybrid search, reranking, evaluation, and graceful failure handling.
The systems that work aren't the most sophisticated—they're the ones that fail predictably and improve systematically.