How catalog.json Is Generated and Used in AI Engineering from Scratch
catalog.json is the canonical curriculum manifest generated by scripts/build_catalog.py and consumed by both the README validator and the static site builder to keep every lesson, skill, and code artifact in sync across the repository.
The ai-engineering-from-scratch curriculum by Rohit Ghumare relies on a single machine-readable manifest called catalog.json, according to the rohitg00/ai-engineering-from-scratch source code. This JSON file serves as the single source of truth for all phases, lessons, skills, and code artifacts. Understanding how this file is generated and where it is consumed is essential for anyone contributing to or extending the curriculum.
How catalog.json Is Generated
The entire manifest is created by a pure-stdlib Python script located at scripts/build_catalog.py. Because it uses only the Python standard library, the script runs without any external dependencies.
The build_catalog.py Discovery Pipeline
When executed, build_catalog.py performs a deterministic walk through the phases/ directory and assembles a hierarchical dictionary that is ultimately serialized to JSON. The pipeline follows these steps:
- Discover phases — The script scans
phases/NN-slug/directories using a regex to identify every phase in numeric order. - Iterate lessons — Inside each phase, it locates lesson folders matching the
NN-slug/pattern. - Extract lesson metadata — For every lesson, it reads the H1 title from
docs/en.md(falling back to a slug-derived title), lists code files (.py,.ts, etc.) fromcode/, parses output artifacts (skill-*.md,prompt-*.md,agent-*.md) fromoutputs/, and records the presence of docs, quizzes, and notebooks. - Compute curriculum totals — It aggregates the number of phases, lessons, skills, prompts, agents, and code files across the entire repository.
- Emit the manifest — The resulting dictionary is written to
catalog.jsonin the repository root by default, or printed to the console when the--stdoutflag is passed.
Regenerating the Catalog Locally
You can regenerate the manifest at any time using the following commands:
# Create or overwrite catalog.json at the repo root
python3 scripts/build_catalog.py
# Preview the JSON on stdout without writing to disk
python3 scripts/build_catalog.py --stdout
Where catalog.json Is Used
Once generated, catalog.json acts as the single source of truth for downstream automation. The repository consumes it in two primary places: README count verification and static site generation.
README Count Verification
The script scripts/check_readme_counts.py reads catalog.json and compares the computed totals against the hard-coded statistics embedded in README.md. If the numbers differ, the script automatically rewrites the README to match the manifest; otherwise it exits with status 0. This ensures that the public-facing documentation never drifts out of sync with the actual curriculum size.
To verify the counts manually, run:
# Exits 0 if README matches the catalog; otherwise updates README.md
python3 scripts/check_readme_counts.py
Static Site Generation
In the website pipeline, site/build.js loads catalog.json and transforms its contents into site/data.js. That data module powers the interactive curriculum website by supplying phase titles, lesson URLs, and artifact metadata to the front-end renderer. Because the repository regenerates catalog.json on every push to main, the website reflects the latest curriculum state without any manual edits.
To rebuild the site locally, use:
# Internally calls site/build.js, which consumes catalog.json
npm run build
Key Files in the catalog.json Pipeline
The following files define the end-to-end lifecycle of the curriculum manifest:
scripts/build_catalog.py— Pure-stdlib generator that walksphases/and writescatalog.json.scripts/check_readme_counts.py— Validates or updatesREADME.mdtotals against the manifest.site/build.js— Node.js build script that converts the manifest into the website’s data layer.catalog.json— The generated artifact at the repository root that serves as the canonical manifest.
Summary
catalog.jsonis the single source of truth for theai-engineering-from-scratchcurriculum structure.scripts/build_catalog.pygenerates the file by scanningphases/and aggregating lesson metadata, code files, and output artifacts.scripts/check_readme_counts.pyconsumes the manifest to keepREADME.mdstatistics accurate.site/build.jstransforms the manifest intosite/data.jsto power the interactive curriculum website.- The pipeline runs automatically on every push to
main, ensuring the repository and website stay synchronized.
Frequently Asked Questions
What is the purpose of catalog.json in ai-engineering-from-scratch?
catalog.json serves as the canonical machine-readable manifest that describes the entire curriculum. It records every phase, lesson, skill, prompt, agent, and code file, enabling downstream scripts to verify documentation counts and render the website without maintaining duplicate data sources.
How do I regenerate catalog.json after adding a new lesson?
Run python3 scripts/build_catalog.py from the repository root. The script will walk the updated phases/ directory, extract metadata from the new lesson’s docs/en.md, code/, and outputs/ folders, and overwrite catalog.json with the new totals. You can preview the result on the terminal by adding the --stdout flag.
Why does check_readme_counts.py modify README.md automatically?
The script treats catalog.json as the ground truth. When it detects a mismatch between the manifest totals and the hard-coded numbers in README.md, it rewrites the README to match the actual curriculum state. This automation prevents stale statistics from being published to the repository.
Does the curriculum website update automatically when catalog.json changes?
Yes. The website build pipeline regenerates catalog.json on every push to main, and site/build.js immediately consumes that fresh manifest to produce site/data.js. Because this happens in CI, the interactive curriculum site always reflects the latest lesson structure without manual intervention.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →