# How RAG Is Implemented with Chunking and Reranking in AI-Engineering-From-Scratch

> Learn how RAG is implemented with token chunking, TF-IDF indexing, and cosine similarity search. Discover our three-stage retrieval architecture and reranking method.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: deep-dive
- Published: 2026-06-10

---

**The RAG pipeline in `rohitg00/ai-engineering-from-scratch` implements a three-stage retrieval architecture that combines token-based chunking with overlap, TF-IDF vector indexing, and cosine-similarity search followed by a lightweight word-overlap reranker to synthesize answers.**

This educational implementation in the AI-Engineering-From-Scratch curriculum demonstrates how Retrieval-Augmented Generation works under the hood without external neural libraries. The entire pipeline resides in [`phases/11-llm-engineering/06-rag/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/11-llm-engineering/06-rag/code/main.py) and uses only Python standard library components to illustrate the core mechanics of chunking, embedding, and reranking.

## The Three-Stage RAG Architecture

The repository follows a classic retrieval flow designed for clarity and algorithmic transparency:

1. **Chunking** – Documents are segmented into overlapping windows to preserve boundary context.
2. **Embedding & Indexing** – Chunks are converted to TF-IDF vectors and stored in `self.embeddings`.
3. **Retrieval & Reranking** – Queries are matched via cosine similarity, then a word-overlap selector reranks sentences to produce the final answer.

## Stage 1: Document Chunking with Overlap

Effective RAG requires breaking large documents into searchable units while ensuring context that spans boundaries remains intact. The implementation achieves this through whitespace tokenization and sliding windows.

### The `chunk_text` Function Implementation

Located at lines 5-14 in [`phases/11-llm-engineering/06-rag/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/11-llm-engineering/06-rag/code/main.py), the `chunk_text` function tokenizes input strings on whitespace and generates fixed-size chunks with configurable overlap:

```python
from rag import chunk_text

text = "Your long document text here spanning multiple sentences..."
chunks = chunk_text(text, chunk_size=200, overlap=50)

```

The function accepts two critical parameters:
- **`chunk_size`** – The number of tokens per chunk.
- **`overlap`** – The number of tokens shared between consecutive chunks to prevent context loss at boundaries.

This approach ensures that semantic units split across chunk boundaries are preserved in adjacent chunks, improving recall during retrieval.

## Stage 2: TF-IDF Embedding and Indexing

Rather than relying on pre-trained neural embeddings, this from-scratch implementation uses classical TF-IDF (Term Frequency-Inverse Document Frequency) to create searchable vector representations.

### Building the Vocabulary and Computing IDF

The pipeline constructs a global vocabulary via `build_vocabulary`, which gathers unique lower-cased tokens from all chunks across the document corpus. It then computes inverse-document-frequency weights using `compute_idf`, establishing the statistical importance of each term.

### Generating Dense Chunk Embeddings

Each chunk is transformed into a dense vector using `tfidf_embed`, which multiplies term-frequency counts by the pre-computed IDF vector. The resulting embeddings are stored in the `self.embeddings` attribute of the `RAGPipeline` class, creating an in-memory index ready for similarity search.

```python
pipeline = RAGPipeline(chunk_size=200, overlap=50, top_k=5)
pipeline.index(
    documents=[doc1, doc2, doc3],
    source_names=["doc1.md", "doc2.md", "doc3.md"]
)

```

## Stage 3: Retrieval and Reranking

The retrieval phase combines vector similarity search with a lightweight reranking mechanism to select the most relevant context for answer generation.

### Cosine Similarity Search

When a query is submitted via `RAGPipeline.query`, the system first embeds the query using the same TF-IDF pipeline (`tfidf_embed`). It then calculates cosine similarity between the query vector and every stored chunk embedding in the `search` method. Results are sorted by similarity score, and the top-k chunks are retrieved based on the `top_k` parameter.

### Lightweight Reranking via Word Overlap

The reranking occurs within the `simple_generate` function. Instead of using a large language model to synthesize answers, this educational implementation acts as a reranker by selecting the sentence from the retrieved chunks that exhibits the highest word-overlap with the original question. Retrieved chunks are first concatenated into a formatted prompt via `build_rag_prompt`, then the generator scans for the best-matching sentence.

```python
result = pipeline.query("What is the refund policy for enterprise customers?")
print(result["answer"])      # Sentence with highest word-overlap

print(result["retrieved"])   # Top-k chunks with similarity scores

```

## Complete Implementation Example

The following example demonstrates the full pipeline from document ingestion to answer retrieval:

```python
from rag import RAGPipeline

# Initialize the pipeline with chunking parameters

pipeline = RAGPipeline(chunk_size=200, overlap=50, top_k=5)

# Index your documents

documents = [
    "Enterprise customers qualify for full refunds within 30 days...",
    "Standard accounts are eligible for partial credits...",
    "All refunds require a support ticket submission..."
]
pipeline.index(documents, source_names=["enterprise.md", "standard.md", "support.md"])

# Query with automatic reranking

result = pipeline.query("What is the refund policy for enterprise customers?")
print(result["prompt"])   # Inspect the constructed context

print(result["answer"])   # View the selected sentence

```

## From-Scratch Design Philosophy

This implementation deliberately avoids dependencies like `scikit-learn` or `sentence-transformers` to expose the underlying mathematics of RAG. According to the repository's [`AGENTS.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/AGENTS.md) philosophy, the "Build-It" approach ensures practitioners understand that **chunking** prevents context window overflow, **TF-IDF embedding** creates sparse-but-interpretable retrieval keys, and **similarity-based reranking** filters noise before generation. In production environments, the TF-IDF step would typically be replaced with dense neural embeddings (e.g., OpenAI's `text-embedding-3-small`), and the word-overlap generator would be substituted with an LLM API call, but the architectural stages remain identical.

## Summary

- **Chunking with overlap** in `chunk_text` preserves context across boundaries by sliding a fixed-size window with configurable overlap.
- **TF-IDF indexing** via `build_vocabulary`, `compute_idf`, and `tfidf_embed` creates searchable vector representations stored in `RAGPipeline.embeddings`.
- **Cosine similarity search** in the `search` method retrieves top-k chunks by comparing query and document vectors.
- **Lightweight reranking** occurs in `simple_generate`, which selects the answer sentence with maximum word-overlap rather than generating new text.
- The entire pipeline lives in [`phases/11-llm-engineering/06-rag/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/11-llm-engineering/06-rag/code/main.py) and uses only Python standard library components.

## Frequently Asked Questions

### How does the chunking strategy prevent loss of context at document boundaries?

The `chunk_text` function uses a sliding window approach where consecutive chunks share an overlap region defined by the `overlap` parameter. When tokenized text is split into fixed-size windows, any semantic unit spanning a boundary appears in both adjacent chunks, ensuring that retrieval queries matching partial context still return the relevant information.

### What type of embedding model does this RAG implementation use?

The repository implements a from-scratch **TF-IDF** (Term Frequency-Inverse Document Frequency) embedding system rather than neural embeddings. The `tfidf_embed` function combines term frequencies with pre-calculated IDF weights to generate sparse vectors. This design choice prioritizes algorithmic transparency over performance, though the architecture supports swapping in dense neural embeddings without changing the retrieval logic.

### Where does the reranking happen in the pipeline?

Reranking occurs in the **generation phase** within the `simple_generate` function. After retrieving top-k chunks via cosine similarity, the system builds a prompt containing the concatenated context. The generator then acts as a reranker by evaluating every sentence in the retrieved text and selecting the one with the highest word-overlap with the query, effectively performing a secondary lexical ranking before returning the final answer.

### Can this implementation scale to large document collections?

The current implementation stores all embeddings in memory (`self.embeddings` in the `RAGPipeline` class) and performs brute-force cosine similarity calculations across the entire corpus. While suitable for educational purposes and small-to-medium datasets, production deployments would require replacing the in-memory list with vector database integrations (e.g., FAISS, Chroma) and the TF-IDF vectors with compressed neural embeddings to handle millions of chunks efficiently.