# Why NumPy, PyTorch, and h5py Are the Only Packages Allowed in the AI Engineering Curriculum

> Discover why AI Engineering restricts packages to NumPy, PyTorch, and h5py. Learn about the educational benefits of a standard-library-first, lightweight, and secure approach.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: best-practices
- Published: 2026-06-08

---

**The AI Engineering curriculum restricts Python packages to NumPy, PyTorch, h5py, zstandard, and safetensors to enforce a standard‑library‑first, educational approach that eliminates black‑box dependencies while keeping builds lightweight and secure.**

The `rohitg00/ai-engineering-from-scratch` repository is built around the philosophy of **“standard‑library‑first”** and **educational clarity**. Each lesson requires students to implement core machine‑learning concepts from scratch before any external library is introduced. This minimal allowlist is codified in the **Dependency Allowlist** inside [`AGENTS.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/AGENTS.md), and it directly shapes how learners approach tensor math, automatic differentiation, and data handling.

## Core Rationale for the Curriculum’s Minimal Dependency Policy

### Focus on Fundamentals

By restricting imports to a handful of well‑vetted libraries, the curriculum forces learners to write the raw math for tensors, gradients, and data pipelines themselves. This prevents the **black‑box effect** that would emerge if large, high‑level frameworks were used throughout every phase. In [`phases/01-foundations/01-linear-algebra/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/01-foundations/01-linear-algebra/code/main.py), students derive solutions manually before ever calling `np.linalg.inv`.

### Predictable, Reproducible Environments

A short allowlist keeps the build environment lightweight and deterministic across all lessons. CI pipelines can install the handful of packages listed in [`requirements.txt`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/requirements.txt) in seconds, avoiding the version conflicts that plague larger dependency graphs. This predictability ensures that every learner starts from an identical state.

### Safety and Security

Fewer third‑party packages mean a reduced attack surface. The selected libraries are mature, widely audited, and have stable APIs, which aligns with the curriculum’s emphasis on **ethics, safety, and alignment** covered in the *Ethics & Safety* phase. The policy rejects experimental or niche libraries that could introduce supply‑chain risks.

### Performance‑Critical Training

While early lessons stay in pure Python or NumPy, realistic training workloads require GPU acceleration. **PyTorch** provides efficient GPU‑backed tensor operations that would be impossible to match with manual Python loops. Meanwhile, **NumPy** supplies fast CPU‑based numerical computing for prototyping.

### Data‑IO and Serialization Support

**h5py** enables reading and writing large binary datasets common in AI research. **safetensors** stores model weights in a fast, safe format, and **zstandard** compresses tensors for efficient I/O. Together these three ensure that lessons shipping pretrained artifacts can do so reliably without pulling in heavy framework ecosystems.

## How the Allowlist Is Enforced in the Repository

### AGENTS.md and requirements.txt

The source of truth is [`AGENTS.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/AGENTS.md), which explicitly lists the permitted packages and the rationale behind them. The companion [`requirements.txt`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/requirements.txt) pins exact versions so that every environment is byte‑for‑byte reproducible. Instructors must justify any deviation with the reason **“stays std‑lib‑first for educational clarity.”**

### CI Validation with audit_lessons.py

The script [`scripts/audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/audit_lessons.py) scans lesson code during CI runs. Any attempt to import a package outside the allowlist triggers a build failure. This automated gate guarantees that the AI Engineering curriculum package restrictions remain intact across community contributions.

## Code Examples: Using the Allowed Libraries in Lesson Files

### NumPy for Raw Linear Algebra

Early‑phase lessons use NumPy to implement closed‑form solutions before moving to automatic differentiation.

```python
import numpy as np

# Generate synthetic data

X = np.random.randn(100, 1)
y = 3.5 * X.squeeze() + np.random.randn(100) * 0.5

# Closed-form solution

w = np.linalg.inv(X.T @ X) @ X.T @ y
print("Learned weight:", w.item())

```

### PyTorch for Neural Network Training

Mid‑phase lessons introduce `torch` for GPU‑accelerated backpropagation once students understand the math.

```python
import torch
import torch.nn as nn
import torch.optim as optim

# Dummy dataset

inputs = torch.randn(200, 10)
targets = torch.randn(200, 1)

# Simple feed-forward network

model = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 1))
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(100):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()
    optimizer.step()
print("Final loss:", loss.item())

```

### h5py for Large Binary Datasets

Data‑handling lessons store feature matrices on disk with compression.

```python
import h5py
import numpy as np

data = np.random.randn(1000, 256).astype("float32")
with h5py.File("dataset.h5", "w") as f:
    f.create_dataset("features", data=data, compression="gzip")

```

### safetensors for Secure Model Checkpoints

Advanced lessons serialize weights without pickle to avoid arbitrary code execution.

```python
import torch
from safetensors.torch import save_file, load_file

model = torch.nn.Linear(5, 2)
state = {"model": model.state_dict()}

# Serialize safely

save_file(state, "checkpoint.safetensors")

# Later load

loaded = load_file("checkpoint.safetensors")
model.load_state_dict(loaded["model"])

```

### zstandard for Fast Tensor Compression

Lessons shipping large artifacts use `zstandard` to reduce I/O overhead.

```python
import torch
import zstandard as zstd
import io

tensor = torch.randn(100, 100)
buf = io.BytesIO()
cctx = zstd.ZstdCompressor()
cctx.compressobj().write(tensor.numpy().tobytes())

# Decompress later with ZstdDecompressor()

```

## Summary

- The AI Engineering curriculum allows only **NumPy**, **PyTorch**, **h5py**, **zstandard**, and **safetensors** to enforce a standard‑library‑first pedagogy.
- The policy is codified in [`AGENTS.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/AGENTS.md) and enforced automatically by [`scripts/audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/audit_lessons.py).
- Limiting dependencies prevents black‑box learning, secures the supply chain, and guarantees reproducible builds.
- Each permitted library serves a specific, irreplaceable role: NumPy for CPU math, PyTorch for GPU training, h5py for dataset I/O, safetensors for weight serialization, and zstandard for compression.

## Frequently Asked Questions

### Why are TensorFlow and JAX not allowed in the AI Engineering curriculum?

The curriculum deliberately excludes high‑level frameworks like TensorFlow and JAX to force students to implement backpropagation, optimizers, and layer logic by hand. Introducing these libraries too early would create the black‑box effect that the standard‑library‑first policy is designed to eliminate.

### Can I add a missing utility library to a lesson if it is not in the allowlist?

No. Any import outside the list is rejected by the CI pipeline running [`scripts/audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/audit_lessons.py). If a lesson truly requires external functionality, the maintainer must update the **Dependency Allowlist** in [`AGENTS.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/AGENTS.md) and provide the justification “stays std‑lib‑first for educational clarity.”

### How does restricting packages improve AI safety and alignment?

A smaller dependency graph reduces the attack surface for supply‑chain exploits and ensures that every library has a stable, audited API. This mirrors the curriculum’s *Ethics & Safety* phase, which teaches that reliable systems start with controlled, transparent dependencies.

### What is the exact difference between h5py and safetensors in the curriculum?

**h5py** is used for storing and loading large training datasets in HDF5 format, while **safetensors** is reserved for serializing and deserializing model weights. Separating data I/O from checkpoint I/O keeps the pipeline modular and avoids locking the curriculum into a single monolithic framework.