internals

How catalog.json Is Generated and Used in AI Engineering from Scratch

June 7, 2026 rohitg00/ai-engineering-from-scratch ↗

catalog.json is the canonical curriculum manifest generated by scripts/build_catalog.py and consumed by both the README validator and the static site builder to keep every lesson, skill, and code artifact in sync across the repository.

The ai-engineering-from-scratch curriculum by Rohit Ghumare relies on a single machine-readable manifest called catalog.json, according to the rohitg00/ai-engineering-from-scratch source code. This JSON file serves as the single source of truth for all phases, lessons, skills, and code artifacts. Understanding how this file is generated and where it is consumed is essential for anyone contributing to or extending the curriculum.

How catalog.json Is Generated

The entire manifest is created by a pure-stdlib Python script located at scripts/build_catalog.py. Because it uses only the Python standard library, the script runs without any external dependencies.

The build_catalog.py Discovery Pipeline

When executed, build_catalog.py performs a deterministic walk through the phases/ directory and assembles a hierarchical dictionary that is ultimately serialized to JSON. The pipeline follows these steps:

Discover phases — The script scans phases/NN-slug/ directories using a regex to identify every phase in numeric order.
Iterate lessons — Inside each phase, it locates lesson folders matching the NN-slug/ pattern.
Extract lesson metadata — For every lesson, it reads the H1 title from docs/en.md (falling back to a slug-derived title), lists code files (.py, .ts, etc.) from code/, parses output artifacts (skill-*.md, prompt-*.md, agent-*.md) from outputs/, and records the presence of docs, quizzes, and notebooks.
Compute curriculum totals — It aggregates the number of phases, lessons, skills, prompts, agents, and code files across the entire repository.
Emit the manifest — The resulting dictionary is written to catalog.json in the repository root by default, or printed to the console when the --stdout flag is passed.

Regenerating the Catalog Locally

You can regenerate the manifest at any time using the following commands:


# Create or overwrite catalog.json at the repo root

python3 scripts/build_catalog.py


# Preview the JSON on stdout without writing to disk

python3 scripts/build_catalog.py --stdout

Where catalog.json Is Used

Once generated, catalog.json acts as the single source of truth for downstream automation. The repository consumes it in two primary places: README count verification and static site generation.

README Count Verification

The script scripts/check_readme_counts.py reads catalog.json and compares the computed totals against the hard-coded statistics embedded in README.md. If the numbers differ, the script automatically rewrites the README to match the manifest; otherwise it exits with status 0. This ensures that the public-facing documentation never drifts out of sync with the actual curriculum size.

To verify the counts manually, run:


# Exits 0 if README matches the catalog; otherwise updates README.md

python3 scripts/check_readme_counts.py

Static Site Generation

In the website pipeline, site/build.js loads catalog.json and transforms its contents into site/data.js. That data module powers the interactive curriculum website by supplying phase titles, lesson URLs, and artifact metadata to the front-end renderer. Because the repository regenerates catalog.json on every push to main, the website reflects the latest curriculum state without any manual edits.

To rebuild the site locally, use:


# Internally calls site/build.js, which consumes catalog.json

npm run build

Key Files in the catalog.json Pipeline

The following files define the end-to-end lifecycle of the curriculum manifest:

scripts/build_catalog.py — Pure-stdlib generator that walks phases/ and writes catalog.json.
scripts/check_readme_counts.py — Validates or updates README.md totals against the manifest.
site/build.js — Node.js build script that converts the manifest into the website’s data layer.
catalog.json — The generated artifact at the repository root that serves as the canonical manifest.

Summary

catalog.json is the single source of truth for the ai-engineering-from-scratch curriculum structure.
scripts/build_catalog.py generates the file by scanning phases/ and aggregating lesson metadata, code files, and output artifacts.
scripts/check_readme_counts.py consumes the manifest to keep README.md statistics accurate.
site/build.js transforms the manifest into site/data.js to power the interactive curriculum website.
The pipeline runs automatically on every push to main, ensuring the repository and website stay synchronized.

Frequently Asked Questions

What is the purpose of catalog.json in ai-engineering-from-scratch?

catalog.json serves as the canonical machine-readable manifest that describes the entire curriculum. It records every phase, lesson, skill, prompt, agent, and code file, enabling downstream scripts to verify documentation counts and render the website without maintaining duplicate data sources.

How do I regenerate catalog.json after adding a new lesson?

Run python3 scripts/build_catalog.py from the repository root. The script will walk the updated phases/ directory, extract metadata from the new lesson’s docs/en.md, code/, and outputs/ folders, and overwrite catalog.json with the new totals. You can preview the result on the terminal by adding the --stdout flag.

Why does check_readme_counts.py modify README.md automatically?

The script treats catalog.json as the ground truth. When it detects a mismatch between the manifest totals and the hard-coded numbers in README.md, it rewrites the README to match the actual curriculum state. This automation prevents stale statistics from being published to the repository.

Does the curriculum website update automatically when catalog.json changes?

Yes. The website build pipeline regenerates catalog.json on every push to main, and site/build.js immediately consumes that fresh manifest to produce site/data.js. Because this happens in CI, the interactive curriculum site always reflects the latest lesson structure without manual intervention.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how rohitg00/ai-engineering-from-scratch works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →