# How catalog.json Is Generated and Used in AI Engineering from Scratch

> Discover how catalog.json is generated and used in AI Engineering from Scratch. Ensure your AI curriculum stays in sync with this essential manifest.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: internals
- Published: 2026-06-07

---

**[`catalog.json`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/catalog.json) is the canonical curriculum manifest generated by [`scripts/build_catalog.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/build_catalog.py) and consumed by both the README validator and the static site builder to keep every lesson, skill, and code artifact in sync across the repository.**

The `ai-engineering-from-scratch` curriculum by Rohit Ghumare relies on a single machine-readable manifest called [`catalog.json`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/catalog.json), according to the rohitg00/ai-engineering-from-scratch source code. This JSON file serves as the single source of truth for all phases, lessons, skills, and code artifacts. Understanding how this file is generated and where it is consumed is essential for anyone contributing to or extending the curriculum.

## How catalog.json Is Generated

The entire manifest is created by a pure-stdlib Python script located at [`scripts/build_catalog.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/build_catalog.py). Because it uses only the Python standard library, the script runs without any external dependencies.

### The build_catalog.py Discovery Pipeline

When executed, [`build_catalog.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/build_catalog.py) performs a deterministic walk through the `phases/` directory and assembles a hierarchical dictionary that is ultimately serialized to JSON. The pipeline follows these steps:

1. **Discover phases** — The script scans `phases/NN-slug/` directories using a regex to identify every phase in numeric order.
2. **Iterate lessons** — Inside each phase, it locates lesson folders matching the `NN-slug/` pattern.
3. **Extract lesson metadata** — For every lesson, it reads the H1 title from [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md) (falling back to a slug-derived title), lists code files (`.py`, `.ts`, etc.) from `code/`, parses output artifacts (`skill-*.md`, `prompt-*.md`, `agent-*.md`) from `outputs/`, and records the presence of docs, quizzes, and notebooks.
4. **Compute curriculum totals** — It aggregates the number of phases, lessons, skills, prompts, agents, and code files across the entire repository.
5. **Emit the manifest** — The resulting dictionary is written to [`catalog.json`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/catalog.json) in the repository root by default, or printed to the console when the `--stdout` flag is passed.

### Regenerating the Catalog Locally

You can regenerate the manifest at any time using the following commands:

```bash

# Create or overwrite catalog.json at the repo root

python3 scripts/build_catalog.py

```

```bash

# Preview the JSON on stdout without writing to disk

python3 scripts/build_catalog.py --stdout

```

## Where catalog.json Is Used

Once generated, [`catalog.json`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/catalog.json) acts as the single source of truth for downstream automation. The repository consumes it in two primary places: README count verification and static site generation.

### README Count Verification

The script [`scripts/check_readme_counts.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/check_readme_counts.py) reads [`catalog.json`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/catalog.json) and compares the computed totals against the hard-coded statistics embedded in [`README.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/README.md). If the numbers differ, the script automatically rewrites the README to match the manifest; otherwise it exits with status `0`. This ensures that the public-facing documentation never drifts out of sync with the actual curriculum size.

To verify the counts manually, run:

```bash

# Exits 0 if README matches the catalog; otherwise updates README.md

python3 scripts/check_readme_counts.py

```

### Static Site Generation

In the website pipeline, [`site/build.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/build.js) loads [`catalog.json`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/catalog.json) and transforms its contents into [`site/data.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/data.js). That data module powers the interactive curriculum website by supplying phase titles, lesson URLs, and artifact metadata to the front-end renderer. Because the repository regenerates [`catalog.json`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/catalog.json) on every push to `main`, the website reflects the latest curriculum state without any manual edits.

To rebuild the site locally, use:

```bash

# Internally calls site/build.js, which consumes catalog.json

npm run build

```

## Key Files in the catalog.json Pipeline

The following files define the end-to-end lifecycle of the curriculum manifest:

- **[`scripts/build_catalog.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/build_catalog.py)** — Pure-stdlib generator that walks `phases/` and writes [`catalog.json`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/catalog.json).
- **[`scripts/check_readme_counts.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/check_readme_counts.py)** — Validates or updates [`README.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/README.md) totals against the manifest.
- **[`site/build.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/build.js)** — Node.js build script that converts the manifest into the website’s data layer.
- **[`catalog.json`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/catalog.json)** — The generated artifact at the repository root that serves as the canonical manifest.

## Summary

- [`catalog.json`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/catalog.json) is the single source of truth for the `ai-engineering-from-scratch` curriculum structure.
- **[`scripts/build_catalog.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/build_catalog.py)** generates the file by scanning `phases/` and aggregating lesson metadata, code files, and output artifacts.
- **[`scripts/check_readme_counts.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/check_readme_counts.py)** consumes the manifest to keep [`README.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/README.md) statistics accurate.
- **[`site/build.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/build.js)** transforms the manifest into [`site/data.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/data.js) to power the interactive curriculum website.
- The pipeline runs automatically on every push to `main`, ensuring the repository and website stay synchronized.

## Frequently Asked Questions

### What is the purpose of catalog.json in ai-engineering-from-scratch?

[`catalog.json`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/catalog.json) serves as the canonical machine-readable manifest that describes the entire curriculum. It records every phase, lesson, skill, prompt, agent, and code file, enabling downstream scripts to verify documentation counts and render the website without maintaining duplicate data sources.

### How do I regenerate catalog.json after adding a new lesson?

Run `python3 scripts/build_catalog.py` from the repository root. The script will walk the updated `phases/` directory, extract metadata from the new lesson’s [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md), `code/`, and `outputs/` folders, and overwrite [`catalog.json`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/catalog.json) with the new totals. You can preview the result on the terminal by adding the `--stdout` flag.

### Why does check_readme_counts.py modify README.md automatically?

The script treats [`catalog.json`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/catalog.json) as the ground truth. When it detects a mismatch between the manifest totals and the hard-coded numbers in [`README.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/README.md), it rewrites the README to match the actual curriculum state. This automation prevents stale statistics from being published to the repository.

### Does the curriculum website update automatically when catalog.json changes?

Yes. The website build pipeline regenerates [`catalog.json`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/catalog.json) on every push to `main`, and [`site/build.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/build.js) immediately consumes that fresh manifest to produce [`site/data.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/data.js). Because this happens in CI, the interactive curriculum site always reflects the latest lesson structure without manual intervention.