RAG System Architecture: Beyond the Basics

Every RAG tutorial shows the same thing: chunk your documents, embed them, store in a vector database, retrieve top-k, and prompt an LLM. That gets you 60% of the way there. This article is about the other 40%.

The Tutorial Gap

Tutorial RAG systems fail in production for predictable reasons:

Fixed-size chunking destroys document structure
Pure vector search misses keyword matches
No way to know when retrieval fails
Context windows overflow with irrelevant chunks

Let's fix each of these.

Chunking That Preserves Meaning

Semantic Chunking

Instead of splitting every 500 tokens, split on semantic boundaries. The algorithm:

Split into sentences
Embed each sentence
Calculate cosine similarity between adjacent sentences
Split where similarity drops below threshold

def semantic_chunk(text, threshold=0.5):
    sentences = split_sentences(text)
    embeddings = embed_batch(sentences)
    
    chunks = []
    current_chunk = [sentences[0]]
    
    for i in range(1, len(sentences)):
        similarity = cosine_sim(embeddings[i-1], embeddings[i])
        if similarity < threshold:
            chunks.append(' '.join(current_chunk))
            current_chunk = []
        current_chunk.append(sentences[i])
    
    return chunks

Document-Aware Chunking

Different document types need different strategies:

Markdown/HTML: Split on headers, preserve hierarchy
PDFs: Respect page boundaries, extract tables separately
Code: Split on function/class boundaries
Conversations: Keep speaker turns together

Hybrid Search

Vector search is great for semantic similarity but terrible for exact matches. "What is our refund policy for order #12345?" will fail if you only use embeddings.

The Solution: Combine BM25 + Vector

def hybrid_search(query, k=10, alpha=0.5):
    # BM25 keyword search
    bm25_results = bm25_index.search(query, k=k*2)
    
    # Vector semantic search  
    query_embedding = embed(query)
    vector_results = vector_db.search(query_embedding, k=k*2)
    
    # Reciprocal Rank Fusion
    combined = reciprocal_rank_fusion(
        [bm25_results, vector_results],
        weights=[alpha, 1-alpha]
    )
    
    return combined[:k]

In practice, I use α=0.3 (favoring semantic) for general queries and α=0.7 (favoring keyword) for queries with specific identifiers.

Reranking

Initial retrieval is fast but imprecise. Reranking is slow but accurate. Use both:

Retrieve top 50 candidates with hybrid search
Rerank with a cross-encoder model
Take top 5 for the LLM context

Cross-encoders like ms-marco-MiniLM-L-6-v2 are small enough to run on CPU and dramatically improve relevance.

Knowing When Retrieval Fails

The worst RAG failure mode is confident wrong answers. You need to detect when retrieval didn't find relevant content.

Relevance Scoring

After reranking, check the top score. If it's below a threshold, your retrieval probably failed:

def should_retrieve(query, results, threshold=0.5):
    if not results:
        return False, "No results found"
    
    top_score = results[0].score
    if top_score < threshold:
        return False, f"Low confidence ({top_score:.2f})"
    
    return True, results

When retrieval fails, have the LLM say "I don't have information about that" instead of hallucinating.

Context Window Management

Don't just concatenate all retrieved chunks. Be strategic:

Deduplicate: Similar chunks waste tokens
Summarize: Long chunks can be compressed
Prioritize: Put most relevant content first (LLMs have primacy bias)
Cite: Add source references for verification

Evaluation Framework

You can't improve what you can't measure. Build evaluation into your pipeline from day one.

Metrics That Matter

Retrieval Precision@k: Are retrieved docs relevant?
Answer Correctness: Is the final answer right?
Faithfulness: Is the answer grounded in retrieved context?
Latency: p50, p95, p99 response times

Building a Test Set

Create a golden dataset of 100+ question-answer pairs. Include:

Easy questions (answer is explicit in one document)
Multi-hop questions (need to combine information)
Unanswerable questions (test refusal behavior)
Questions with specific identifiers (test keyword search)

Production Architecture

My current production stack:

Embedding: OpenAI text-embedding-3-small (good price/performance)
Vector DB: Qdrant (fast, good filtering)
BM25: Elasticsearch or built-in Qdrant sparse vectors
Reranker: Cohere rerank-english-v2.0
LLM: Claude 3.5 Sonnet (best instruction following)

Conclusion

Production RAG is 20% retrieval and 80% engineering around the edges: chunking, hybrid search, reranking, evaluation, and graceful failure handling.

The systems that work aren't the most sophisticated—they're the ones that fail predictably and improve systematically.