Effective Implementation Strategies for RAG: A Production-Grade Architecture

Effective RAG implementation requires a multi-stage pipeline combining recursive document chunking, hybrid retrieval with Reciprocal Rank Fusion, query rewriting techniques like HyDE, and cross-encoder reranking to maximize retrieval accuracy before generation.

Retrieval-Augmented Generation (RAG) enhances large language models by grounding responses in external knowledge, but production deployments demand rigorous engineering across every stage. The repository rohitg00/ai-engineering-from-scratch provides a complete reference implementation demonstrating how to build robust, measurable RAG systems from first principles.

Document Ingestion and Intelligent Chunking

The foundation of any RAG system lies in how raw documents are segmented. Poor chunk boundaries destroy semantic coherence and degrade both BM25 and dense retrieval performance.

The reference implementation in phases/19-capstone-projects/64-chunking-strategies-advanced/code/main.py provides a Chunker class that uses recursive splitting with sentence-aware boundaries. This approach combines fixed-size windows (typically 256–512 tokens) with overlap to preserve context across chunk boundaries. According to the repository's LongRAG-style experiments, tuning chunk size within this range recovers approximately 35% of recall compared to naive character-based splitting.

Key implementation details include maintaining chunk metadata (source document ID, section headers) to enable precise citation tracking during the generation phase.

Hybrid Retrieval Architecture

Production RAG systems cannot rely solely on dense vector similarity. The repository implements a hybrid retrieval strategy that combines lexical and semantic search to handle diverse query distributions.

Dense Vector Indexing

The DenseIndex class in phases/19-capstone-projects/69-end-to-end-rag-system/code/main.py (line 186) encodes chunks using dense embeddings stored in a fast Approximate Nearest Neighbor (ANN) index. This excels at capturing semantic similarity and paraphrased concepts.

BM25 Lexical Retrieval

Complementing dense retrieval, the BM25Index class (line 23) provides classic term-frequency based indexing. BM25 shines on exact-match queries containing specific technical terms, product names, or error codes where dense vectors might fail.

Reciprocal Rank Fusion (RRF)

Rather than interpolating scores from different retrieval methods, the HybridIndex.rrf method (line 200) implements Reciprocal Rank Fusion. This rank-based voting algorithm merges results by summing reciprocal ranks:


# Conceptual RRF implementation

score = 0.0
for rank in ranks_from_different_retrievers:
    score += 1.0 / (k + rank)  # k is typically 60

RRF is deterministic, requires no score normalization, and empirically outperforms naive score averaging across heterogeneous retrieval methods.

Query Rewriting Strategies

Raw user queries often misalign with the indexed corpus vocabulary. The Rewriter class in phases/19-capstone-projects/69-end-to-end-rag-system/code/main.py (lesson 67) implements three production-grade transformation techniques:

HyDE (Hypothetical Document Embeddings): The rewrite_hyde method generates a hypothetical answer to the query, embeds that answer, and retrieves against the hypothesis rather than the original question. This injects factual context that better matches the document embeddings.

Multi-Query Expansion: The rewrite_multiquery method generates N paraphrases of the original query, retrieves chunks for each variant, and aggregates results. This adds diversity and captures different lexical variations of the same intent.

Query Decomposition: The rewrite_decompose method splits multi-faceted questions into sub-questions, preventing "over-splitting" where a complex query fails to match any single chunk.

Cross-Encoder Reranking

Initial retrieval returns a broad candidate set (top-100), but the generator's context window is limited. The CrossEncoder implementation in phases/19-capstone-projects/66-reranker-cross-encoder/code/main.py performs a second-pass relevance scoring.

Unlike bi-encoders that encode query and document separately, the cross-encoder takes the concatenated query-candidate pair as input, enabling interaction-aware relevance scoring. This reranking step runs only on the top-k candidates (typically 20–100), making it computationally efficient while significantly improving precision before generation.

Generation with Source Citations

The generate_answer function (line 435) feeds the reranked chunks as grounding context to the LLM. The prompt engineering strategy appends explicit instructions forcing the model to cite sources using the chunk metadata preserved during ingestion:

def generate_answer(query, contexts):
    prompt = f"""Answer the question using only the provided context.
    Cite the source document ID for each claim.
    
    Context: {contexts}
    Question: {query}
    """
    return llm.generate(prompt)

Evaluation Metrics and Feedback Loops

RAG systems require rigorous evaluation across both retrieval and generation quality. The run_eval function in phases/19-capstone-projects/68-rag-eval-precision-recall/code/main.py (line 670) implements a comprehensive metrics suite:

  • Recall@k: Primary retrieval metric measuring whether relevant documents appear in the top-k results
  • Precision@k: Measures the proportion of retrieved chunks that are relevant
  • MRR (Mean Reciprocal Rank): Evaluates the position of the first relevant document
  • nDCG (Normalized Discounted Cumulative Gain): Accounts for graded relevance and ranking position
  • Faithfulness: Measures whether generated claims are supported by retrieved contexts
  • Answer Relevance: Scores how well the final answer addresses the original query

Using Recall@k as the headline metric surfaces retrieval-stage failures, while precision and faithfulness pinpoint downstream generation bottlenecks.

End-to-End Pipeline Orchestration

The Pipeline class (line 472) in phases/19-capstone-projects/69-end-to-end-rag-system/code/main.py orchestrates all components into a unified interface:

from pathlib import Path
from phases_19_capstone_projects_69_end_to_end_rag_system.code.main import build_pipeline

# Initialize pipeline with all components

pipeline = build_pipeline()

# Ingest corpus with document IDs

corpus = [
    ("doc1", Path("data/faq.txt").read_text()),
    ("doc2", Path("data/api.md").read_text()),
]
pipeline.ingest(corpus)

# Execute query

question = "How can I reset my API key?"
result = pipeline.query(question)

print("Answer:", result.answer)
print("Citations:", result.citations)  # ['doc1#section-2']

The build_pipeline() function wires the chunker → hybrid index (BM25 + Dense) → rewriter → cross-encoder reranker → generator. The query method returns a structured Result object containing the final answer, retrieved chunks, and source citations.

Summary

Effective RAG implementation strategies from the ai-engineering-from-scratch repository include:

  • Recursive chunking with 256–512 token windows and sentence-aware boundaries to preserve semantic coherence
  • Hybrid retrieval combining BM25 for exact matches and dense vectors for semantic similarity, merged via Reciprocal Rank Fusion
  • Query rewriting using HyDE, multi-query expansion, and decomposition to align user intent with corpus vocabulary
  • Cross-encoder reranking as a lightweight second-pass to refine top-k candidates before generation
  • Metric-driven iteration using Recall@k, precision, MRR, and faithfulness to identify and fix pipeline weaknesses

Frequently Asked Questions

What is the optimal chunk size for RAG systems?

The repository demonstrates that chunk sizes between 256–512 tokens provide the best recall balance. Sizes below 256 tokens often split semantic units, while sizes above 512 tokens dilute relevance signals and may exceed embedding model limits. The Chunker implementation in lesson 64 uses recursive splitting with overlap to maintain context across boundaries.

How does Reciprocal Rank Fusion differ from score interpolation?

Reciprocal Rank Fusion (RRF) combines rankings rather than raw scores. While interpolation requires normalizing incompatible score scales from different retrieval methods (BM25 scores vs. cosine similarities), RRF simply sums the reciprocal ranks (1/(k + rank)) from each retriever. This approach is normalization-free, deterministic, and robust across query types, as implemented in the HybridIndex.rrf method.

When should I use HyDE versus multi-query expansion?

Use HyDE (Hypothetical Document Embeddings) when queries are underspecified or require domain knowledge to match document vocabulary, as it generates a hypothetical answer that bridges the lexical gap. Use multi-query expansion when the query is clear but might have multiple valid phrasings or when you need to capture different aspects of a complex question. The repository implements both in the Rewriter class (lesson 67) and suggests they can be combined for maximum recall.

Why is cross-encoder reranking necessary if I already have dense retrieval?

Dense retrieval uses bi-encoders that encode queries and documents separately, which is fast but sacrifices interaction-aware precision. The cross-encoder (lesson 66) concatenates query and candidate text, enabling the model to attend to interactions between them. While too slow for the initial retrieval over millions of documents, it provides significant relevance gains when applied to the top-100 candidates, filtering out false positives before the expensive generation step.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →