# Data Preprocessing Techniques in AI Engineering from Scratch: Complete Implementation Guide

> Implement AI data preprocessing techniques from scratch. Explore NLP tokenization, lemmatization, image normalization, and LLM text cleaning in this comprehensive guide.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: how-to-guide
- Published: 2026-06-06

---

**Yes, the rohitg00/ai-engineering-from-scratch repository includes comprehensive data preprocessing techniques spanning tokenization and lemmatization for NLP, image normalization for computer vision, and production-grade text cleaning with MinHash-LSH deduplication for large language model training.**

The curriculum provides hands-on implementations that bridge the gap between raw data and model-ready tensors. From text cleaning pipelines that strip HTML and remove near-duplicates using MinHash-LSH, to vision preprocessing that handles NumPy arrays and PyTorch tensors, this repository demonstrates essential data preprocessing techniques across multiple AI domains.

## NLP Data Preprocessing: Tokenization, Stemming, and Lemmatization

The repository implements classical NLP preprocessing in [`phases/05-nlp-foundations-to-advanced/01-text-processing/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/05-nlp-foundations-to-advanced/01-text-processing/code/main.py). The `preprocess` function orchestrates a three-stage pipeline that converts raw text into structured linguistic features.

**Tokenization** extracts words and punctuation via regex patterns. The `tokenize` function splits input strings into discrete elements while preserving semantic boundaries.

**Stemming** applies rule-based suffix stripping through the `stem_step_1a` function, reducing words to their root forms using algorithmic transformation rules.

**Lemmatization** utilizes the `lemmatize` function, which references a small lemma table or falls back to rule-based suffix stripping to produce dictionary-standard forms. The pipeline optionally integrates a POS tagger via the `demo_pos_tagger` parameter.

```python

# File: phases/05-nlp-foundations-to-advanced/01-text-processing/code/main.py

from main import preprocess, demo_pos_tagger

text = "The cats were running at 3pm."
result = preprocess(text, pos_tagger=demo_pos_tagger)

print("Tokens →", result["tokens"])
print("Stems  →", result["stems"])
print("Lemmas →", result["lemmas"])

```

Running this script produces:

```

Tokens → ['The', 'cats', 'were', 'running', 'at', '3pm', '.']
Stems  → ['the', 'cat', 'were', 'running', 'at', '3pm', '.']
Lemmas → ['the', 'cat', 'be', 'run', 'at', '3pm', '.']

```

## Computer Vision Preprocessing: Image Normalization and Tensor Conversion

For computer vision workflows, the repository provides the `VisionPipeline` class in [`phases/04-computer-vision/16-vision-pipeline-capstone/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/04-computer-vision/16-vision-pipeline-capstone/code/main.py). The `preprocess` method handles heterogeneous inputs, validating dimensions and normalizing pixel values for neural network consumption.

The implementation automatically detects whether the input is a NumPy `ndarray` or PyTorch `Tensor`. It validates dimensionality requirements (`HxWx3` for NumPy arrays, `(3, H, W)` for tensors) before applying normalization. **Normalization** scales raw uint8 pixel values to the `[0, 1]` floating-point range. The method then transfers the tensor to the target compute device (`cpu` or `cuda`).

```python

# File: phases/04-computer-vision/16-vision-pipeline-capstone/code/main.py

import numpy as np
from main import VisionPipeline, StubDetector, StubClassifier

# Create a dummy RGB image (HxWx3)

img = (np.random.rand(400, 600, 3) * 255).astype(np.uint8)

detector = StubDetector()
classifier = StubClassifier(num_classes=10)
pipeline = VisionPipeline(detector, classifier, [f"class_{i}" for i in range(10)])

tensor = pipeline.preprocess(img)      # → torch.FloatTensor on CPU, values ∈[0, 1]

print(tensor.shape)                    # (3, 400, 600)

```

Additional image preprocessing examples appear in [`phases/04-computer-vision/01-image-fundamentals/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/04-computer-vision/01-image-fundamentals/code/main.py), which includes the `preprocess_imagenet` function for ImageNet-style normalization.

## LLM Data Pipeline: Production-Grade Text Preprocessing

The most comprehensive preprocessing implementation resides in [`phases/10-llms-from-scratch/03-data-pipelines/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/03-data-pipelines/code/main.py), which mirrors production data ingestion pipelines for large language model training. This six-stage pipeline processes raw documents into batched training tensors.

### Stage 1: HTML Cleaning and Text Normalization

The `clean_text` function strips HTML tags, removes URLs, and collapses whitespace to produce plain text suitable for tokenization.

### Stage 2: Quality Filtering

The `quality_filter` function applies heuristic gates to reject documents that are too short, contain excessive capitalization, or are dominated by special characters. This ensures training data meets minimum coherence standards.

### Stage 3: Deduplication with MinHash-LSH

The `deduplicate` function implements **MinHash-LSH** (Locality-Sensitive Hashing) to identify and remove near-duplicate documents. It computes shingle sets, builds MinHash signatures, buckets them via LSH, and drops duplicates using **Jaccard similarity** thresholds.

### Stage 4-6: Tokenization, Packing, and Batching

The `SimpleTokenizer` trains a **byte-pair-encoding (BPE)** tokenizer on the cleaned corpus. The `pack_sequences` function pads or truncates token streams into fixed-length sequences with attention masks. Finally, `PreTrainingDataLoader` shuffles and yields mini-batches for training.

```python

# File: phases/10-llms-from-scratch/03-data-pipelines/code/main.py

from main import run_pipeline

if __name__ == "__main__":
    run_pipeline()

```

Execution produces step-by-step statistics:

```

Stage 1: Cleaning → 35 documents
Stage 2: Quality Filtering → removed 4 low‑quality docs
Stage 3: Deduplication → removed 3 near‑duplicates
Stage 4: Tokenizer trained with 256 tokens
...
Dataset Statistics → total_documents: 28, total_tokens: 1,234,567, vocab_utilization: 45%

```

## Key Implementation Files

The repository organizes preprocessing utilities across domain-specific modules:

- [`phases/05-nlp-foundations-to-advanced/01-text-processing/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/05-nlp-foundations-to-advanced/01-text-processing/code/main.py) — NLP tokenization, stemming, lemmatization, and the `preprocess` helper.
- [`phases/04-computer-vision/16-vision-pipeline-capstone/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/04-computer-vision/16-vision-pipeline-capstone/code/main.py) — Vision pipeline with `VisionPipeline.preprocess` method for image tensor conversion.
- [`phases/10-llms-from-scratch/03-data-pipelines/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/03-data-pipelines/code/main.py) — Full LLM data pipeline including `clean_text`, `quality_filter`, `deduplicate`, BPE tokenization, and sequence packing.
- [`phases/04-computer-vision/01-image-fundamentals/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/04-computer-vision/01-image-fundamentals/code/main.py) — ImageNet-style preprocessing via `preprocess_imagenet`.

## Summary

- The **rohitg00/ai-engineering-from-scratch** repository implements **data preprocessing techniques** across NLP, computer vision, and LLM domains.
- **NLP preprocessing** includes regex-based tokenization, rule-based stemming, and lemmatization with optional POS tagging in [`phases/05-nlp-foundations-to-advanced/01-text-processing/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/05-nlp-foundations-to-advanced/01-text-processing/code/main.py).
- **Vision preprocessing** handles NumPy and PyTorch inputs with automatic device placement and pixel normalization in [`phases/04-computer-vision/16-vision-pipeline-capstone/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/04-computer-vision/16-vision-pipeline-capstone/code/main.py).
- **LLM pipelines** provide production-grade cleaning, MinHash-LSH deduplication, BPE tokenization, and sequence packing in [`phases/10-llms-from-scratch/03-data-pipelines/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/03-data-pipelines/code/main.py).
- All modules are self-contained and executable without external services, enabling hands-on experimentation with real-world preprocessing workflows.

## Frequently Asked Questions

### What specific text cleaning methods are implemented in the LLM data pipeline?

The LLM pipeline in [`phases/10-llms-from-scratch/03-data-pipelines/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/03-data-pipelines/code/main.py) implements `clean_text` for HTML stripping and URL removal, `quality_filter` for heuristic document rejection based on length and character distribution, and `deduplicate` for near-duplicate detection using MinHash signatures and Jaccard similarity thresholds.

### How does the vision preprocessing handle different input formats?

The `VisionPipeline.preprocess` method in [`phases/04-computer-vision/16-vision-pipeline-capstone/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/04-computer-vision/16-vision-pipeline-capstone/code/main.py) detects whether the input is a NumPy `ndarray` or PyTorch `Tensor`, validates dimensions for RGB images (requiring `HxWx3` or `(3, H, W)` formats), normalizes pixel values to `[0, 1]`, and moves the data to the configured compute device.

### Does the repository include stemming or lemmatization for NLP tasks?

Yes, the [`phases/05-nlp-foundations-to-advanced/01-text-processing/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/05-nlp-foundations-to-advanced/01-text-processing/code/main.py) file contains both `stem_step_1a` for rule-based stemming and `lemmatize` for dictionary-based or rule-based lemmatization, wrapped in a unified `preprocess` function that returns tokens, stems, and lemmas alongside optional POS tags.

### Are the preprocessing utilities production-ready for large datasets?

The LLM preprocessing pipeline includes production-grade features like MinHash-LSH deduplication and memory-efficient batching via `PreTrainingDataLoader`, though the implementations in `ai-engineering-from-scratch` are designed for educational clarity and single-machine execution rather than distributed cluster processing.