Data Preprocessing Techniques in AI Engineering from Scratch: Complete Implementation Guide

Yes, the rohitg00/ai-engineering-from-scratch repository includes comprehensive data preprocessing techniques spanning tokenization and lemmatization for NLP, image normalization for computer vision, and production-grade text cleaning with MinHash-LSH deduplication for large language model training.

The curriculum provides hands-on implementations that bridge the gap between raw data and model-ready tensors. From text cleaning pipelines that strip HTML and remove near-duplicates using MinHash-LSH, to vision preprocessing that handles NumPy arrays and PyTorch tensors, this repository demonstrates essential data preprocessing techniques across multiple AI domains.

NLP Data Preprocessing: Tokenization, Stemming, and Lemmatization

The repository implements classical NLP preprocessing in phases/05-nlp-foundations-to-advanced/01-text-processing/code/main.py. The preprocess function orchestrates a three-stage pipeline that converts raw text into structured linguistic features.

Tokenization extracts words and punctuation via regex patterns. The tokenize function splits input strings into discrete elements while preserving semantic boundaries.

Stemming applies rule-based suffix stripping through the stem_step_1a function, reducing words to their root forms using algorithmic transformation rules.

Lemmatization utilizes the lemmatize function, which references a small lemma table or falls back to rule-based suffix stripping to produce dictionary-standard forms. The pipeline optionally integrates a POS tagger via the demo_pos_tagger parameter.


# File: phases/05-nlp-foundations-to-advanced/01-text-processing/code/main.py

from main import preprocess, demo_pos_tagger

text = "The cats were running at 3pm."
result = preprocess(text, pos_tagger=demo_pos_tagger)

print("Tokens →", result["tokens"])
print("Stems  →", result["stems"])
print("Lemmas →", result["lemmas"])

Running this script produces:


Tokens → ['The', 'cats', 'were', 'running', 'at', '3pm', '.']
Stems  → ['the', 'cat', 'were', 'running', 'at', '3pm', '.']
Lemmas → ['the', 'cat', 'be', 'run', 'at', '3pm', '.']

Computer Vision Preprocessing: Image Normalization and Tensor Conversion

For computer vision workflows, the repository provides the VisionPipeline class in phases/04-computer-vision/16-vision-pipeline-capstone/code/main.py. The preprocess method handles heterogeneous inputs, validating dimensions and normalizing pixel values for neural network consumption.

The implementation automatically detects whether the input is a NumPy ndarray or PyTorch Tensor. It validates dimensionality requirements (HxWx3 for NumPy arrays, (3, H, W) for tensors) before applying normalization. Normalization scales raw uint8 pixel values to the [0, 1] floating-point range. The method then transfers the tensor to the target compute device (cpu or cuda).


# File: phases/04-computer-vision/16-vision-pipeline-capstone/code/main.py

import numpy as np
from main import VisionPipeline, StubDetector, StubClassifier

# Create a dummy RGB image (HxWx3)

img = (np.random.rand(400, 600, 3) * 255).astype(np.uint8)

detector = StubDetector()
classifier = StubClassifier(num_classes=10)
pipeline = VisionPipeline(detector, classifier, [f"class_{i}" for i in range(10)])

tensor = pipeline.preprocess(img)      # → torch.FloatTensor on CPU, values ∈[0, 1]

print(tensor.shape)                    # (3, 400, 600)

Additional image preprocessing examples appear in phases/04-computer-vision/01-image-fundamentals/code/main.py, which includes the preprocess_imagenet function for ImageNet-style normalization.

LLM Data Pipeline: Production-Grade Text Preprocessing

The most comprehensive preprocessing implementation resides in phases/10-llms-from-scratch/03-data-pipelines/code/main.py, which mirrors production data ingestion pipelines for large language model training. This six-stage pipeline processes raw documents into batched training tensors.

Stage 1: HTML Cleaning and Text Normalization

The clean_text function strips HTML tags, removes URLs, and collapses whitespace to produce plain text suitable for tokenization.

Stage 2: Quality Filtering

The quality_filter function applies heuristic gates to reject documents that are too short, contain excessive capitalization, or are dominated by special characters. This ensures training data meets minimum coherence standards.

Stage 3: Deduplication with MinHash-LSH

The deduplicate function implements MinHash-LSH (Locality-Sensitive Hashing) to identify and remove near-duplicate documents. It computes shingle sets, builds MinHash signatures, buckets them via LSH, and drops duplicates using Jaccard similarity thresholds.

Stage 4-6: Tokenization, Packing, and Batching

The SimpleTokenizer trains a byte-pair-encoding (BPE) tokenizer on the cleaned corpus. The pack_sequences function pads or truncates token streams into fixed-length sequences with attention masks. Finally, PreTrainingDataLoader shuffles and yields mini-batches for training.


# File: phases/10-llms-from-scratch/03-data-pipelines/code/main.py

from main import run_pipeline

if __name__ == "__main__":
    run_pipeline()

Execution produces step-by-step statistics:


Stage 1: Cleaning → 35 documents
Stage 2: Quality Filtering → removed 4 low‑quality docs
Stage 3: Deduplication → removed 3 near‑duplicates
Stage 4: Tokenizer trained with 256 tokens
...
Dataset Statistics → total_documents: 28, total_tokens: 1,234,567, vocab_utilization: 45%

Key Implementation Files

The repository organizes preprocessing utilities across domain-specific modules:

Summary

Frequently Asked Questions

What specific text cleaning methods are implemented in the LLM data pipeline?

The LLM pipeline in phases/10-llms-from-scratch/03-data-pipelines/code/main.py implements clean_text for HTML stripping and URL removal, quality_filter for heuristic document rejection based on length and character distribution, and deduplicate for near-duplicate detection using MinHash signatures and Jaccard similarity thresholds.

How does the vision preprocessing handle different input formats?

The VisionPipeline.preprocess method in phases/04-computer-vision/16-vision-pipeline-capstone/code/main.py detects whether the input is a NumPy ndarray or PyTorch Tensor, validates dimensions for RGB images (requiring HxWx3 or (3, H, W) formats), normalizes pixel values to [0, 1], and moves the data to the configured compute device.

Does the repository include stemming or lemmatization for NLP tasks?

Yes, the phases/05-nlp-foundations-to-advanced/01-text-processing/code/main.py file contains both stem_step_1a for rule-based stemming and lemmatize for dictionary-based or rule-based lemmatization, wrapped in a unified preprocess function that returns tokens, stems, and lemmas alongside optional POS tags.

Are the preprocessing utilities production-ready for large datasets?

The LLM preprocessing pipeline includes production-grade features like MinHash-LSH deduplication and memory-efficient batching via PreTrainingDataLoader, though the implementations in ai-engineering-from-scratch are designed for educational clarity and single-machine execution rather than distributed cluster processing.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →