How RAG Is Implemented with Chunking and Reranking in AI-Engineering-From-Scratch

The RAG pipeline in rohitg00/ai-engineering-from-scratch implements a three-stage retrieval architecture that combines token-based chunking with overlap, TF-IDF vector indexing, and cosine-similarity search followed by a lightweight word-overlap reranker to synthesize answers.

This educational implementation in the AI-Engineering-From-Scratch curriculum demonstrates how Retrieval-Augmented Generation works under the hood without external neural libraries. The entire pipeline resides in phases/11-llm-engineering/06-rag/code/main.py and uses only Python standard library components to illustrate the core mechanics of chunking, embedding, and reranking.

The Three-Stage RAG Architecture

The repository follows a classic retrieval flow designed for clarity and algorithmic transparency:

  1. Chunking – Documents are segmented into overlapping windows to preserve boundary context.
  2. Embedding & Indexing – Chunks are converted to TF-IDF vectors and stored in self.embeddings.
  3. Retrieval & Reranking – Queries are matched via cosine similarity, then a word-overlap selector reranks sentences to produce the final answer.

Stage 1: Document Chunking with Overlap

Effective RAG requires breaking large documents into searchable units while ensuring context that spans boundaries remains intact. The implementation achieves this through whitespace tokenization and sliding windows.

The chunk_text Function Implementation

Located at lines 5-14 in phases/11-llm-engineering/06-rag/code/main.py, the chunk_text function tokenizes input strings on whitespace and generates fixed-size chunks with configurable overlap:

from rag import chunk_text

text = "Your long document text here spanning multiple sentences..."
chunks = chunk_text(text, chunk_size=200, overlap=50)

The function accepts two critical parameters:

  • chunk_size – The number of tokens per chunk.
  • overlap – The number of tokens shared between consecutive chunks to prevent context loss at boundaries.

This approach ensures that semantic units split across chunk boundaries are preserved in adjacent chunks, improving recall during retrieval.

Stage 2: TF-IDF Embedding and Indexing

Rather than relying on pre-trained neural embeddings, this from-scratch implementation uses classical TF-IDF (Term Frequency-Inverse Document Frequency) to create searchable vector representations.

Building the Vocabulary and Computing IDF

The pipeline constructs a global vocabulary via build_vocabulary, which gathers unique lower-cased tokens from all chunks across the document corpus. It then computes inverse-document-frequency weights using compute_idf, establishing the statistical importance of each term.

Generating Dense Chunk Embeddings

Each chunk is transformed into a dense vector using tfidf_embed, which multiplies term-frequency counts by the pre-computed IDF vector. The resulting embeddings are stored in the self.embeddings attribute of the RAGPipeline class, creating an in-memory index ready for similarity search.

pipeline = RAGPipeline(chunk_size=200, overlap=50, top_k=5)
pipeline.index(
    documents=[doc1, doc2, doc3],
    source_names=["doc1.md", "doc2.md", "doc3.md"]
)

Stage 3: Retrieval and Reranking

The retrieval phase combines vector similarity search with a lightweight reranking mechanism to select the most relevant context for answer generation.

When a query is submitted via RAGPipeline.query, the system first embeds the query using the same TF-IDF pipeline (tfidf_embed). It then calculates cosine similarity between the query vector and every stored chunk embedding in the search method. Results are sorted by similarity score, and the top-k chunks are retrieved based on the top_k parameter.

Lightweight Reranking via Word Overlap

The reranking occurs within the simple_generate function. Instead of using a large language model to synthesize answers, this educational implementation acts as a reranker by selecting the sentence from the retrieved chunks that exhibits the highest word-overlap with the original question. Retrieved chunks are first concatenated into a formatted prompt via build_rag_prompt, then the generator scans for the best-matching sentence.

result = pipeline.query("What is the refund policy for enterprise customers?")
print(result["answer"])      # Sentence with highest word-overlap

print(result["retrieved"])   # Top-k chunks with similarity scores

Complete Implementation Example

The following example demonstrates the full pipeline from document ingestion to answer retrieval:

from rag import RAGPipeline

# Initialize the pipeline with chunking parameters

pipeline = RAGPipeline(chunk_size=200, overlap=50, top_k=5)

# Index your documents

documents = [
    "Enterprise customers qualify for full refunds within 30 days...",
    "Standard accounts are eligible for partial credits...",
    "All refunds require a support ticket submission..."
]
pipeline.index(documents, source_names=["enterprise.md", "standard.md", "support.md"])

# Query with automatic reranking

result = pipeline.query("What is the refund policy for enterprise customers?")
print(result["prompt"])   # Inspect the constructed context

print(result["answer"])   # View the selected sentence

From-Scratch Design Philosophy

This implementation deliberately avoids dependencies like scikit-learn or sentence-transformers to expose the underlying mathematics of RAG. According to the repository's AGENTS.md philosophy, the "Build-It" approach ensures practitioners understand that chunking prevents context window overflow, TF-IDF embedding creates sparse-but-interpretable retrieval keys, and similarity-based reranking filters noise before generation. In production environments, the TF-IDF step would typically be replaced with dense neural embeddings (e.g., OpenAI's text-embedding-3-small), and the word-overlap generator would be substituted with an LLM API call, but the architectural stages remain identical.

Summary

  • Chunking with overlap in chunk_text preserves context across boundaries by sliding a fixed-size window with configurable overlap.
  • TF-IDF indexing via build_vocabulary, compute_idf, and tfidf_embed creates searchable vector representations stored in RAGPipeline.embeddings.
  • Cosine similarity search in the search method retrieves top-k chunks by comparing query and document vectors.
  • Lightweight reranking occurs in simple_generate, which selects the answer sentence with maximum word-overlap rather than generating new text.
  • The entire pipeline lives in phases/11-llm-engineering/06-rag/code/main.py and uses only Python standard library components.

Frequently Asked Questions

How does the chunking strategy prevent loss of context at document boundaries?

The chunk_text function uses a sliding window approach where consecutive chunks share an overlap region defined by the overlap parameter. When tokenized text is split into fixed-size windows, any semantic unit spanning a boundary appears in both adjacent chunks, ensuring that retrieval queries matching partial context still return the relevant information.

What type of embedding model does this RAG implementation use?

The repository implements a from-scratch TF-IDF (Term Frequency-Inverse Document Frequency) embedding system rather than neural embeddings. The tfidf_embed function combines term frequencies with pre-calculated IDF weights to generate sparse vectors. This design choice prioritizes algorithmic transparency over performance, though the architecture supports swapping in dense neural embeddings without changing the retrieval logic.

Where does the reranking happen in the pipeline?

Reranking occurs in the generation phase within the simple_generate function. After retrieving top-k chunks via cosine similarity, the system builds a prompt containing the concatenated context. The generator then acts as a reranker by evaluating every sentence in the retrieved text and selecting the one with the highest word-overlap with the query, effectively performing a secondary lexical ranking before returning the final answer.

Can this implementation scale to large document collections?

The current implementation stores all embeddings in memory (self.embeddings in the RAGPipeline class) and performs brute-force cosine similarity calculations across the entire corpus. While suitable for educational purposes and small-to-medium datasets, production deployments would require replacing the in-memory list with vector database integrations (e.g., FAISS, Chroma) and the TF-IDF vectors with compressed neural embeddings to handle millions of chunks efficiently.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →