# How `site/data.js` Is Generated in ai-engineering-from-scratch: Build Pipeline Explained

> Discover how site/data.js is generated in ai-engineering-from-scratch. Learn about the build pipeline that parses markdown and scans lesson directories, exporting PRASES, GLOSSARY, and ARTIFACTS.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: internals
- Published: 2026-06-05

---

**The [`site/data.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/data.js) file is produced by the [`site/build.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/build.js) Node.js script, which parses core curriculum markdown files, scans lesson directories for artifacts, and serializes the results into three exported arrays: `PHASES`, `GLOSSARY`, and `ARTIFACTS`.**

In the `rohitg00/ai-engineering-from-scratch` repository, the static curriculum site consumes a single auto-generated data module that stays in lock-step with the repository's markdown sources. Grasping the full process for generating [`site/data.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/data.js) helps contributors trace how curriculum changes surface on the site and how the build pipeline maintains consistency across phase listings, glossary definitions, and reusable artifacts.

## Overview of the [`site/build.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/build.js) Pipeline

The entire generation workflow lives inside [`site/build.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/build.js). According to the source code, the script is invoked automatically by a GitHub Actions CI step—`node site/build.js`—on every push, ensuring the emitted JavaScript file never drifts from the latest curriculum state. The pipeline proceeds through a sequence of parsing, enrichment, and serialization stages before updating downstream marketing pages.

## Step 1: Parsing Curriculum Structure and Lesson Status

### Reading [`ROADMAP.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/ROADMAP.md) with `parseRoadmap()`

The script begins by locating the repository root and building absolute paths to the three primary markdown sources: [`README.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/README.md), [`ROADMAP.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/ROADMAP.md), and [`glossary/terms.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/glossary/terms.md). The `parseRoadmap()` function inside [`site/build.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/build.js) opens [`ROADMAP.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/ROADMAP.md) and scans for phase headers (`## Phase …`) and lesson rows formatted as `| 01 | … | ✅ |`. It constructs a nested map that links each phase to its lessons and their canonical statuses—**complete**, **in-progress**, or **planned**.

### Extracting Phases from [`README.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/README.md) with `parseReadme()`

Next, `parseReadme()` reads [`README.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/README.md) line-by-line to extract every phase’s ID, name, and description, along with its embedded lesson table. For each lesson row, the function captures:

- **Lesson name** and optional external link.
- **Type** (`Build` or `Learn`) by resolving badge images.
- **Language list** by converting emoji flags into readable language names.
- **Status**, merged from the roadmap map built earlier. If a lesson links to a GitHub repository but lacks an explicit roadmap entry, the script falls back to marking it as **complete**.

## Step 2: Harvesting Glossary Terms and Lesson Artifacts

### Collecting Definitions via `parseGlossary()`

Once the curriculum skeleton is in place, `parseGlossary()` walks [`glossary/terms.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/glossary/terms.md) and extracts each `### Term` heading. It pairs every term with its corresponding “What people say” and “What it actually means” lines, returning an array of term objects that becomes the `GLOSSARY` export.

### Discovering Reusable Artifacts with `discoverArtifacts()`

The `discoverArtifacts()` function recursively scans every lesson’s `outputs/` directory across the `phases/` tree. It parses front-matter fields—such as `name`, `description`, and `tags`—from each discovered file and creates dedicated entries for missions when it encounters a [`mission.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/mission.md). The result is a flat list of reusable **skills**, **prompts**, **agents**, and **missions** that the site can render independently of lesson pages.

## Step 3: Enriching Lesson Metadata

### Extracting Summaries and Keywords with `extractLessonMeta()`

For lessons that specify a URL, `extractLessonMeta()` reads the lesson’s local [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md) file. It treats the leading blockquote as a one-line summary and concatenates every `### ` heading into a keyword string. These enriched fields are then attached directly to the corresponding lesson objects inside the `PHASES` array, powering lesson cards and search indexing without manual duplication.

## Step 4: Assembling and Writing [`site/data.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/data.js)

With parsing complete, the script holds three top-level data structures:

- **`PHASES`** — phase objects containing nested lessons with status, type, languages, URL, summary, and keywords.
- **`GLOSSARY`** — term objects with human-readable definitions.
- **`ARTIFACTS`** — flattened skill, prompt, agent, and mission objects.

It then serializes these arrays into [`site/data.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/data.js), prepending an auto-generated comment header that includes an ISO timestamp and a warning not to edit the file manually. The emitted module follows this pattern:

```js
// Auto-generated by build.js — do not edit manually.
// Last built: <ISO-timestamp>

const PHASES = [ … ];
const GLOSSARY = [ … ];
const ARTIFACTS = [ … ];

```

You can regenerate the file locally at any time:

```bash
cd ai-engineering-from-scratch
node site/build.js

```

Running the command prints progress messages and finishes with a confirmation that [`site/data.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/data.js) has been updated.

You can import the generated module directly in site code:

```js
import { PHASES, GLOSSARY, ARTIFACTS } from './site/data.js';

const completed = PHASES
  .find(p => p.id === 3)
  .lessons.filter(l => l.status === 'complete');

console.log(completed.map(l => l.name));

```

## Step 5: Synchronizing Marketing Counts

After writing [`site/data.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/data.js), `syncCounts()` updates static HTML pages—including [`index.html`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/index.html) and [`catalog.html`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/catalog.html)—to reflect the current totals for lessons, phases, and artifacts. This keeps marketing copy accurate without requiring hand-edited placeholders after every curriculum update.

## Summary

- **Single script**: All generation logic lives in [`site/build.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/build.js) and runs automatically via GitHub Actions on every push.
- **Three parsers**: `parseRoadmap()`, `parseReadme()`, and `parseGlossary()` ingest [`ROADMAP.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/ROADMAP.md), [`README.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/README.md), and [`glossary/terms.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/glossary/terms.md) respectively.
- **Artifact discovery**: `discoverArtifacts()` recursively harvests front-matter from `outputs/` folders and [`mission.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/mission.md) files.
- **Lesson enrichment**: `extractLessonMeta()` pulls summaries and keywords from each lesson’s [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md).
- **Unified output**: The script writes [`site/data.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/data.js), exporting `PHASES`, `GLOSSARY`, and `ARTIFACTS`, then calls `syncCounts()` to refresh static page totals.

## Frequently Asked Questions

### What triggers the generation of [`site/data.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/data.js)?

A GitHub Actions workflow invokes `node site/build.js` on every push to the repository. This automated step guarantees that any change to [`README.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/README.md), [`ROADMAP.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/ROADMAP.md), or lesson content is immediately reflected in the static site data.

### Which source files does [`site/build.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/build.js) parse to build the data module?

The script reads [`README.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/README.md) for phase and lesson structure, [`ROADMAP.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/ROADMAP.md) for canonical lesson statuses, and [`glossary/terms.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/glossary/terms.md) for definitions. It also inspects individual lesson files—specifically [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md) for metadata and the `outputs/` directories for artifacts—across the `phases/` tree.

### What data structures are exported from the generated [`site/data.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/data.js)?

The generated file exports three top-level arrays: **`PHASES`** (curriculum structure with enriched lesson objects), **`GLOSSARY`** (term definitions), and **`ARTIFACTS`** (reusable skills, prompts, agents, and missions collected from lesson outputs).

### How does the build script determine whether a lesson is marked as complete?

`parseReadme()` merges the status map produced by `parseRoadmap()`, which reads status emojis from [`ROADMAP.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/ROADMAP.md). If a lesson row in [`README.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/README.md) contains a GitHub repository link but has no explicit roadmap entry, the script applies a fallback status of **complete**.