internals

How site/build.js Generates the data.js File in ai-engineering-from-scratch

June 7, 2026 rohitg00/ai-engineering-from-scratch ↗

The site/build.js script transforms the repository's markdown curriculum into a structured JavaScript module by parsing README.md and ROADMAP.md, enriching content with metadata from lesson directories, and exporting three constants to site/data.js for client-side consumption.

The rohitg00/ai-engineering-from-scratch repository automates its website data generation through a Node.js build pipeline. Understanding how site/build.js generates the data.js file allows contributors to modify curriculum structures while ensuring the frontend remains synchronized with the source markdown.

Step 1: Loading Curriculum Source Files

The build process begins by reading three primary markdown sources from the repository root. According to the source code in site/build.js lines 4–9 and 20–28, the script uses fs.readFileSync() to load:

README.md – Contains the human-readable phase and lesson tables
ROADMAP.md – Tracks lesson completion status via emoji indicators
glossary/terms.md – Stores definitions for curriculum terminology

Step 2: Parsing Markdown Structures

After loading the raw content, site/build.js employs three specialized parser functions to extract structured data from the markdown sources.

Extracting Lesson Status from ROADMAP.md

The parseRoadmap() function (lines 30–61) scans ROADMAP.md and extracts lesson-status emojis, returning a plain JavaScript object that maps lessons to their completion states.

Processing Phase Tables in README.md

The parseReadme() function (lines 63–131) walks through the phase tables in README.md to pull lesson names, GitHub links, content types, and language badges. During this process, it converts emoji badges to human-readable strings—such as translating 🐍 to "Python" (lines 48–66)—and matches each lesson against the roadmap status map. The function also implements a critical guard clause: if a lesson has a valid URL but its status is marked as planned, the script forces the status to complete (lines 96–99).

Building the Glossary Index

The parseGlossary() function (lines 71–90) processes glossary/terms.md to extract term definitions, returning a plain object that becomes the GLOSSARY export.

Step 3: Artifact Discovery and Metadata Enrichment

Before writing the output, the script enriches lesson objects with metadata from the filesystem and lesson documentation.

Scanning Output Directories

The discoverArtifacts() function (lines 112–198) walks the phases/*/*/outputs/ directories to gather deliverables. It collects files prefixed with skill-, prompt-, or agent-, along with phase-14 mission files, parsing each file's frontmatter using parseFrontmatter() to extract structured metadata.

Extracting Lesson Summaries and Keywords

The extractLessonMeta() function (lines 47–68) reads each lesson's docs/en.md file (when present) to extract a one-line summary and harvest all ### headings as searchable keywords. This metadata enables client-side search functionality in the generated website.

The lessonPath() utility (lines 23–28) transforms GitHub URLs into site-compatible paths by stripping the base URL, yielding paths formatted as /lesson.html?path=… for frontend routing.

Step 4: Emitting the JavaScript Module

The final stage assembles the collected data into a JavaScript module. In lines 49–59, the script constructs a string containing three exported constants: PHASES, GLOSSARY, and ARTIFACTS. This string includes a header comment with a generation timestamp. The script then writes this content to site/data.js (lines 60–62), making the curriculum data available for import by the website's client code.

Running the Build Locally

To regenerate site/data.js outside of GitHub Actions:

cd /path/to/ai-engineering-from-scratch
node site/build.js

This command prints progress messages and creates the site/data.js file containing the exported data structures.

To inspect the generated module:

// Import in a Node script or browser environment
import { PHASES, GLOSSARY, ARTIFACTS } from './site/data.js';

// Example: List all completed lessons in Phase 3
const completed = PHASES
  .find(p => p.id === 3)
  .lessons.filter(l => l.status === 'complete')
  .map(l => l.name);
console.log('Completed Phase‑3 lessons:', completed);

You can also utilize the exported data for search implementations:

function searchLessons(keyword) {
  const lower = keyword.toLowerCase();
  return PHASES.flatMap(p =>
    p.lessons.filter(l =>
      (l.summary && l.summary.toLowerCase().includes(lower)) ||
      (l.keywords && l.keywords.toLowerCase().includes(lower))
    )
  );
}

Summary

site/build.js serves as the single source of truth for curriculum data generation in the rohitg00/ai-engineering-from-scratch repository.
The script parses README.md, ROADMAP.md, and glossary/terms.md to construct lesson structures, status maps, and term definitions.
Artifact discovery scans phases/*/*/outputs/ directories to collect skill, prompt, and agent files, while metadata extraction harvests summaries from lesson documentation.
Generated output includes three exported constants—PHASES, GLOSSARY, and ARTIFACTS—written to site/data.js with a generation timestamp.
The build process is idempotent and runs automatically via GitHub Actions on every push, ensuring the website always reflects the current curriculum state.

Frequently Asked Questions

What input files does site/build.js require to generate data.js?

The script requires three markdown files from the repository root: README.md (for phase and lesson tables), ROADMAP.md (for completion status emojis), and glossary/terms.md (for term definitions). It also dynamically scans the phases/ directory tree to discover lesson artifacts and metadata.

How does site/build.js determine if a lesson is complete?

The script cross-references the roadmap status with the presence of lesson content. According to lines 96–99, if a lesson has a valid GitHub URL (indicating content exists) but the roadmap marks it as planned, site/build.js overrides the status to complete.

Can I run site/build.js locally without GitHub Actions?

Yes. You can execute node site/build.js from the repository root on any system with Node.js installed. This regenerates site/data.js using the current state of the markdown sources and directory structure, making it useful for local development and testing.

What data structures does the generated site/data.js export?

The generated file exports three constants: PHASES (an array of phase objects containing lessons with metadata), GLOSSARY (an object mapping terms to definitions), and ARTIFACTS (an array of discovered skill, prompt, and agent files with parsed frontmatter). These constants are used by the website's client-side JavaScript to render the curriculum interface.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how rohitg00/ai-engineering-from-scratch works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →