internals

How `site/data.js` Is Generated in ai-engineering-from-scratch: Build Pipeline Explained

June 5, 2026 rohitg00/ai-engineering-from-scratch ↗

The site/data.js file is produced by the site/build.js Node.js script, which parses core curriculum markdown files, scans lesson directories for artifacts, and serializes the results into three exported arrays: PHASES, GLOSSARY, and ARTIFACTS.

In the rohitg00/ai-engineering-from-scratch repository, the static curriculum site consumes a single auto-generated data module that stays in lock-step with the repository's markdown sources. Grasping the full process for generating site/data.js helps contributors trace how curriculum changes surface on the site and how the build pipeline maintains consistency across phase listings, glossary definitions, and reusable artifacts.

Overview of the `site/build.js` Pipeline

The entire generation workflow lives inside site/build.js. According to the source code, the script is invoked automatically by a GitHub Actions CI step—node site/build.js—on every push, ensuring the emitted JavaScript file never drifts from the latest curriculum state. The pipeline proceeds through a sequence of parsing, enrichment, and serialization stages before updating downstream marketing pages.

Step 1: Parsing Curriculum Structure and Lesson Status

Reading `ROADMAP.md` with `parseRoadmap()`

The script begins by locating the repository root and building absolute paths to the three primary markdown sources: README.md, ROADMAP.md, and glossary/terms.md. The parseRoadmap() function inside site/build.js opens ROADMAP.md and scans for phase headers (## Phase …) and lesson rows formatted as | 01 | … | ✅ |. It constructs a nested map that links each phase to its lessons and their canonical statuses—complete, in-progress, or planned.

Extracting Phases from `README.md` with `parseReadme()`

Next, parseReadme() reads README.md line-by-line to extract every phase’s ID, name, and description, along with its embedded lesson table. For each lesson row, the function captures:

Lesson name and optional external link.
Type (Build or Learn) by resolving badge images.
Language list by converting emoji flags into readable language names.
Status, merged from the roadmap map built earlier. If a lesson links to a GitHub repository but lacks an explicit roadmap entry, the script falls back to marking it as complete.

Step 2: Harvesting Glossary Terms and Lesson Artifacts

Collecting Definitions via `parseGlossary()`

Once the curriculum skeleton is in place, parseGlossary() walks glossary/terms.md and extracts each ### Term heading. It pairs every term with its corresponding “What people say” and “What it actually means” lines, returning an array of term objects that becomes the GLOSSARY export.

Discovering Reusable Artifacts with `discoverArtifacts()`

The discoverArtifacts() function recursively scans every lesson’s outputs/ directory across the phases/ tree. It parses front-matter fields—such as name, description, and tags—from each discovered file and creates dedicated entries for missions when it encounters a mission.md. The result is a flat list of reusable skills, prompts, agents, and missions that the site can render independently of lesson pages.

Step 3: Enriching Lesson Metadata

Extracting Summaries and Keywords with `extractLessonMeta()`

For lessons that specify a URL, extractLessonMeta() reads the lesson’s local docs/en.md file. It treats the leading blockquote as a one-line summary and concatenates every ### heading into a keyword string. These enriched fields are then attached directly to the corresponding lesson objects inside the PHASES array, powering lesson cards and search indexing without manual duplication.

Step 4: Assembling and Writing `site/data.js`

With parsing complete, the script holds three top-level data structures:

PHASES — phase objects containing nested lessons with status, type, languages, URL, summary, and keywords.
GLOSSARY — term objects with human-readable definitions.
ARTIFACTS — flattened skill, prompt, agent, and mission objects.

It then serializes these arrays into site/data.js, prepending an auto-generated comment header that includes an ISO timestamp and a warning not to edit the file manually. The emitted module follows this pattern:

// Auto-generated by build.js — do not edit manually.
// Last built: <ISO-timestamp>

const PHASES = [ … ];
const GLOSSARY = [ … ];
const ARTIFACTS = [ … ];

You can regenerate the file locally at any time:

cd ai-engineering-from-scratch
node site/build.js

Running the command prints progress messages and finishes with a confirmation that site/data.js has been updated.

You can import the generated module directly in site code:

import { PHASES, GLOSSARY, ARTIFACTS } from './site/data.js';

const completed = PHASES
  .find(p => p.id === 3)
  .lessons.filter(l => l.status === 'complete');

console.log(completed.map(l => l.name));

Step 5: Synchronizing Marketing Counts

After writing site/data.js, syncCounts() updates static HTML pages—including index.html and catalog.html—to reflect the current totals for lessons, phases, and artifacts. This keeps marketing copy accurate without requiring hand-edited placeholders after every curriculum update.

Summary

Single script: All generation logic lives in site/build.js and runs automatically via GitHub Actions on every push.
Three parsers: parseRoadmap(), parseReadme(), and parseGlossary() ingest ROADMAP.md, README.md, and glossary/terms.md respectively.
Artifact discovery: discoverArtifacts() recursively harvests front-matter from outputs/ folders and mission.md files.
Lesson enrichment: extractLessonMeta() pulls summaries and keywords from each lesson’s docs/en.md.
Unified output: The script writes site/data.js, exporting PHASES, GLOSSARY, and ARTIFACTS, then calls syncCounts() to refresh static page totals.

Frequently Asked Questions

What triggers the generation of `site/data.js`?

A GitHub Actions workflow invokes node site/build.js on every push to the repository. This automated step guarantees that any change to README.md, ROADMAP.md, or lesson content is immediately reflected in the static site data.

Which source files does `site/build.js` parse to build the data module?

The script reads README.md for phase and lesson structure, ROADMAP.md for canonical lesson statuses, and glossary/terms.md for definitions. It also inspects individual lesson files—specifically docs/en.md for metadata and the outputs/ directories for artifacts—across the phases/ tree.

What data structures are exported from the generated `site/data.js`?

The generated file exports three top-level arrays: PHASES (curriculum structure with enriched lesson objects), GLOSSARY (term definitions), and ARTIFACTS (reusable skills, prompts, agents, and missions collected from lesson outputs).

How does the build script determine whether a lesson is marked as complete?

parseReadme() merges the status map produced by parseRoadmap(), which reads status emojis from ROADMAP.md. If a lesson row in README.md contains a GitHub repository link but has no explicit roadmap entry, the script applies a fallback status of complete.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how rohitg00/ai-engineering-from-scratch works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →

How `site/data.js` Is Generated in ai-engineering-from-scratch: Build Pipeline Explained

Overview of the site/build.js Pipeline

Step 1: Parsing Curriculum Structure and Lesson Status

Reading ROADMAP.md with parseRoadmap()

Extracting Phases from README.md with parseReadme()