Why NumPy, PyTorch, and h5py Are the Only Packages Allowed in the AI Engineering Curriculum

The AI Engineering curriculum restricts Python packages to NumPy, PyTorch, h5py, zstandard, and safetensors to enforce a standard‑library‑first, educational approach that eliminates black‑box dependencies while keeping builds lightweight and secure.

The rohitg00/ai-engineering-from-scratch repository is built around the philosophy of “standard‑library‑first” and educational clarity. Each lesson requires students to implement core machine‑learning concepts from scratch before any external library is introduced. This minimal allowlist is codified in the Dependency Allowlist inside AGENTS.md, and it directly shapes how learners approach tensor math, automatic differentiation, and data handling.

Core Rationale for the Curriculum’s Minimal Dependency Policy

Focus on Fundamentals

By restricting imports to a handful of well‑vetted libraries, the curriculum forces learners to write the raw math for tensors, gradients, and data pipelines themselves. This prevents the black‑box effect that would emerge if large, high‑level frameworks were used throughout every phase. In phases/01-foundations/01-linear-algebra/code/main.py, students derive solutions manually before ever calling np.linalg.inv.

Predictable, Reproducible Environments

A short allowlist keeps the build environment lightweight and deterministic across all lessons. CI pipelines can install the handful of packages listed in requirements.txt in seconds, avoiding the version conflicts that plague larger dependency graphs. This predictability ensures that every learner starts from an identical state.

Safety and Security

Fewer third‑party packages mean a reduced attack surface. The selected libraries are mature, widely audited, and have stable APIs, which aligns with the curriculum’s emphasis on ethics, safety, and alignment covered in the Ethics & Safety phase. The policy rejects experimental or niche libraries that could introduce supply‑chain risks.

Performance‑Critical Training

While early lessons stay in pure Python or NumPy, realistic training workloads require GPU acceleration. PyTorch provides efficient GPU‑backed tensor operations that would be impossible to match with manual Python loops. Meanwhile, NumPy supplies fast CPU‑based numerical computing for prototyping.

Data‑IO and Serialization Support

h5py enables reading and writing large binary datasets common in AI research. safetensors stores model weights in a fast, safe format, and zstandard compresses tensors for efficient I/O. Together these three ensure that lessons shipping pretrained artifacts can do so reliably without pulling in heavy framework ecosystems.

How the Allowlist Is Enforced in the Repository

AGENTS.md and requirements.txt

The source of truth is AGENTS.md, which explicitly lists the permitted packages and the rationale behind them. The companion requirements.txt pins exact versions so that every environment is byte‑for‑byte reproducible. Instructors must justify any deviation with the reason “stays std‑lib‑first for educational clarity.”

CI Validation with audit_lessons.py

The script scripts/audit_lessons.py scans lesson code during CI runs. Any attempt to import a package outside the allowlist triggers a build failure. This automated gate guarantees that the AI Engineering curriculum package restrictions remain intact across community contributions.

Code Examples: Using the Allowed Libraries in Lesson Files

NumPy for Raw Linear Algebra

Early‑phase lessons use NumPy to implement closed‑form solutions before moving to automatic differentiation.

import numpy as np

# Generate synthetic data

X = np.random.randn(100, 1)
y = 3.5 * X.squeeze() + np.random.randn(100) * 0.5

# Closed-form solution

w = np.linalg.inv(X.T @ X) @ X.T @ y
print("Learned weight:", w.item())

PyTorch for Neural Network Training

Mid‑phase lessons introduce torch for GPU‑accelerated backpropagation once students understand the math.

import torch
import torch.nn as nn
import torch.optim as optim

# Dummy dataset

inputs = torch.randn(200, 10)
targets = torch.randn(200, 1)

# Simple feed-forward network

model = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 1))
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(100):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()
    optimizer.step()
print("Final loss:", loss.item())

h5py for Large Binary Datasets

Data‑handling lessons store feature matrices on disk with compression.

import h5py
import numpy as np

data = np.random.randn(1000, 256).astype("float32")
with h5py.File("dataset.h5", "w") as f:
    f.create_dataset("features", data=data, compression="gzip")

safetensors for Secure Model Checkpoints

Advanced lessons serialize weights without pickle to avoid arbitrary code execution.

import torch
from safetensors.torch import save_file, load_file

model = torch.nn.Linear(5, 2)
state = {"model": model.state_dict()}

# Serialize safely

save_file(state, "checkpoint.safetensors")

# Later load

loaded = load_file("checkpoint.safetensors")
model.load_state_dict(loaded["model"])

zstandard for Fast Tensor Compression

Lessons shipping large artifacts use zstandard to reduce I/O overhead.

import torch
import zstandard as zstd
import io

tensor = torch.randn(100, 100)
buf = io.BytesIO()
cctx = zstd.ZstdCompressor()
cctx.compressobj().write(tensor.numpy().tobytes())

# Decompress later with ZstdDecompressor()

Summary

  • The AI Engineering curriculum allows only NumPy, PyTorch, h5py, zstandard, and safetensors to enforce a standard‑library‑first pedagogy.
  • The policy is codified in AGENTS.md and enforced automatically by scripts/audit_lessons.py.
  • Limiting dependencies prevents black‑box learning, secures the supply chain, and guarantees reproducible builds.
  • Each permitted library serves a specific, irreplaceable role: NumPy for CPU math, PyTorch for GPU training, h5py for dataset I/O, safetensors for weight serialization, and zstandard for compression.

Frequently Asked Questions

Why are TensorFlow and JAX not allowed in the AI Engineering curriculum?

The curriculum deliberately excludes high‑level frameworks like TensorFlow and JAX to force students to implement backpropagation, optimizers, and layer logic by hand. Introducing these libraries too early would create the black‑box effect that the standard‑library‑first policy is designed to eliminate.

Can I add a missing utility library to a lesson if it is not in the allowlist?

No. Any import outside the list is rejected by the CI pipeline running scripts/audit_lessons.py. If a lesson truly requires external functionality, the maintainer must update the Dependency Allowlist in AGENTS.md and provide the justification “stays std‑lib‑first for educational clarity.”

How does restricting packages improve AI safety and alignment?

A smaller dependency graph reduces the attack surface for supply‑chain exploits and ensures that every library has a stable, audited API. This mirrors the curriculum’s Ethics & Safety phase, which teaches that reliable systems start with controlled, transparent dependencies.

What is the exact difference between h5py and safetensors in the curriculum?

h5py is used for storing and loading large training datasets in HDF5 format, while safetensors is reserved for serializing and deserializing model weights. Separating data I/O from checkpoint I/O keeps the pipeline modular and avoids locking the curriculum into a single monolithic framework.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →