internals

How `data.js` Is Generated from `README.md` and `ROADMAP.md` in AI Engineering from Scratch

June 8, 2026 rohitg00/ai-engineering-from-scratch ↗

The site/build.js script generates site/data.js by parsing README.md and ROADMAP.md into structured JavaScript objects, cross-referencing completion status, and enriching each lesson with metadata pulled from its docs/en.md file.

In the rohitg00/ai-engineering-from-scratch repository, the curriculum website is powered by a fully automated build pipeline. Rather than maintaining JSON or JavaScript data files by hand, the project generates data.js from README.md and ROADMAP.md every time the source changes. This keeps the human-readable markdown files as the single source of truth while producing a structured payload that the front-end consumes directly.

The Role of `site/build.js`

The entire pipeline lives in site/build.js as implemented in rohitg00/ai-engineering-from-scratch. According to the source code, this file is the single source of truth that turns the human-written curriculum files into the JavaScript data model consumed by the website. The process runs automatically on every push via GitHub Actions and can be invoked locally with node site/build.js.

Step 1: Reading the Source Files

At the start of the build, the script loads the three core markdown files into memory.

The relevant lines read:

const readme   = fs.readFileSync(README_PATH, 'utf8');   // L4-L6
const roadmap  = fs.readFileSync(ROADMAP_PATH, 'utf8'); // L4-L6
const glossary = fs.readFileSync(GLOSSARY_PATH, 'utf8'); // L4-L6

The script reads README.md for the public overview and lesson tables, ROADMAP.md for the phase and lesson status matrix, and glossary/terms.md for the glossary definitions.

Step 2: Parsing `ROADMAP.md` with `parseRoadmap`

To determine completion status, the build script calls parseRoadmap, which scans each line of ROADMAP.md for phase headers and individual lesson rows.

As implemented in the file at lines 30-61, the function identifies patterns such as ## Phase 0 … — ✅ for phases and | 01 | Dev Environment | ✅ | for lessons. It then builds a nested map with the structure:


{ Phase → { phaseStatus, lessons: { lessonName → status } } }

This status map becomes the authority for whether a lesson is marked complete, in-progress, or planned.

Step 3: Parsing `README.md` with `parseReadme`

Next, the function parseReadme walks README.md line-by-line to discover phases and lessons. As implemented at lines 63-130, it locates phase headings—supporting both the legacy table-based format and the newer <details> blocks—and then parses the lesson tables that follow them.

For every lesson row, parseReadme extracts:

The lesson name and optional link.
The lesson type (e.g., "Build", "Learn").
The language list via emoji-to-language conversion.
The GitHub URL to the lesson source, if a link is present.

Crucially, it cross-references the status map produced by parseRoadmap to attach a status field to each lesson. If no match is found, the status falls back to "planned".

Step 4: Enriching Lessons with `extractLessonMeta`

After the curriculum structure is known, the build loop at lines 22-30 of the build function calls extractLessonMeta for every lesson that has a source URL. This helper opens the lesson's docs/en.md file and pulls two pieces of metadata:

The first blockquote, which is used as a one-sentence summary.
All H3 headings, which are concatenated into a keyword string.

These values are added to the lesson object as summary and keywords, making the generated data.js searchable without manual indexing.

Step 5: Assembling the Final JavaScript Payload

Once parsing and enrichment are complete, the script constructs three constants:

const PHASES   = …   // full list of phases & lessons
const GLOSSARY = …   // parsed glossary terms
const ARTIFACTS = …  // discovered outputs

As implemented at lines 49-58 of site/build.js, the script stringifies these objects with indentation and writes them to site/data.js. The file is a valid JavaScript module that exports PHASES, GLOSSARY, and ARTIFACTS for the front-end to import.

The build also performs secondary tasks—such as updating README badges, site statistics, a sitemap, and an llms.txt file—but the core data consumed by the website is the auto-generated PHASES, GLOSSARY, and ARTIFACTS payload.

How to Run the Generator Locally

You can trigger the pipeline manually from the repository root.

node site/build.js

After execution, the console logs each stage:


📖 Reading source files...
🔍 Parsing ROADMAP.md...
🔍 Parsing README.md...
🔍 Parsing glossary/terms.md...
🔍 Discovering outputs + Phase 14 missions...
📚 Extracting lesson summaries + keywords from docs/en.md...
✅ Generated site/data.js

To inspect the output in a Node.js REPL or another script:

const { PHASES, GLOSSARY, ARTIFACTS } = require('./site/data.js');

console.log(PHASES.length);          // number of phases
console.log(PHASES[0].lessons[0]);   // first lesson object
console.log(GLOSSARY[0]);            // first glossary entry

Summary

The site/build.js script is the sole build entry point that generates data.js from README.md and ROADMAP.md in the rohitg00/ai-engineering-from-scratch project.
parseRoadmap (lines 30-61) extracts completion status from ROADMAP.md into a nested map.
parseReadme (lines 63-130) discovers phases and lessons from README.md and attaches the roadmap status to each entry.
extractLessonMeta enriches lessons with summaries and keywords by reading individual docs/en.md files.
The final PHASES, GLOSSARY, and ARTIFACTS constants are written to site/data.js at lines 49-58, producing a front-end-ready JavaScript module on every push.

Frequently Asked Questions

What triggers the generation of `site/data.js`?

The generation runs automatically on every push via GitHub Actions, but you can also invoke it locally by running node site/build.js from the repository root.

Which files does `site/build.js` read besides `README.md` and `ROADMAP.md`?

In addition to the two primary curriculum files, the script reads glossary/terms.md for domain definitions and scans each lesson's outputs/ folder to discover reusable artifacts.

How does a lesson receive its completion status in `data.js`?

The parseReadme function cross-references the nested status map built by parseRoadmap. Each lesson is matched by name; if no match is found in ROADMAP.md, the status defaults to "planned".

Can I use the generated `data.js` outside of the website?

Yes. Because site/data.js is a standard JavaScript module that exports PHASES, GLOSSARY, and ARTIFACTS, you can require or import it into any Node.js script or compatible bundler for custom reporting or integrations.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how rohitg00/ai-engineering-from-scratch works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →

How `data.js` Is Generated from `README.md` and `ROADMAP.md` in AI Engineering from Scratch

The Role of site/build.js

Step 1: Reading the Source Files

Step 2: Parsing ROADMAP.md with parseRoadmap

Step 3: Parsing README.md with parseReadme

Step 4: Enriching Lessons with extractLessonMeta