How `data.js` Is Generated from `README.md` and `ROADMAP.md` in AI Engineering from Scratch
The site/build.js script generates site/data.js by parsing README.md and ROADMAP.md into structured JavaScript objects, cross-referencing completion status, and enriching each lesson with metadata pulled from its docs/en.md file.
In the rohitg00/ai-engineering-from-scratch repository, the curriculum website is powered by a fully automated build pipeline. Rather than maintaining JSON or JavaScript data files by hand, the project generates data.js from README.md and ROADMAP.md every time the source changes. This keeps the human-readable markdown files as the single source of truth while producing a structured payload that the front-end consumes directly.
The Role of site/build.js
The entire pipeline lives in site/build.js as implemented in rohitg00/ai-engineering-from-scratch. According to the source code, this file is the single source of truth that turns the human-written curriculum files into the JavaScript data model consumed by the website. The process runs automatically on every push via GitHub Actions and can be invoked locally with node site/build.js.
Step 1: Reading the Source Files
At the start of the build, the script loads the three core markdown files into memory.
The relevant lines read:
const readme = fs.readFileSync(README_PATH, 'utf8'); // L4-L6
const roadmap = fs.readFileSync(ROADMAP_PATH, 'utf8'); // L4-L6
const glossary = fs.readFileSync(GLOSSARY_PATH, 'utf8'); // L4-L6
The script reads README.md for the public overview and lesson tables, ROADMAP.md for the phase and lesson status matrix, and glossary/terms.md for the glossary definitions.
Step 2: Parsing ROADMAP.md with parseRoadmap
To determine completion status, the build script calls parseRoadmap, which scans each line of ROADMAP.md for phase headers and individual lesson rows.
As implemented in the file at lines 30-61, the function identifies patterns such as ## Phase 0 … — ✅ for phases and | 01 | Dev Environment | ✅ | for lessons. It then builds a nested map with the structure:
{ Phase → { phaseStatus, lessons: { lessonName → status } } }
This status map becomes the authority for whether a lesson is marked complete, in-progress, or planned.
Step 3: Parsing README.md with parseReadme
Next, the function parseReadme walks README.md line-by-line to discover phases and lessons. As implemented at lines 63-130, it locates phase headings—supporting both the legacy table-based format and the newer <details> blocks—and then parses the lesson tables that follow them.
For every lesson row, parseReadme extracts:
- The lesson name and optional link.
- The lesson type (e.g., "Build", "Learn").
- The language list via emoji-to-language conversion.
- The GitHub URL to the lesson source, if a link is present.
Crucially, it cross-references the status map produced by parseRoadmap to attach a status field to each lesson. If no match is found, the status falls back to "planned".
Step 4: Enriching Lessons with extractLessonMeta
After the curriculum structure is known, the build loop at lines 22-30 of the build function calls extractLessonMeta for every lesson that has a source URL. This helper opens the lesson's docs/en.md file and pulls two pieces of metadata:
- The first blockquote, which is used as a one-sentence summary.
- All H3 headings, which are concatenated into a keyword string.
These values are added to the lesson object as summary and keywords, making the generated data.js searchable without manual indexing.
Step 5: Assembling the Final JavaScript Payload
Once parsing and enrichment are complete, the script constructs three constants:
const PHASES = … // full list of phases & lessons
const GLOSSARY = … // parsed glossary terms
const ARTIFACTS = … // discovered outputs
As implemented at lines 49-58 of site/build.js, the script stringifies these objects with indentation and writes them to site/data.js. The file is a valid JavaScript module that exports PHASES, GLOSSARY, and ARTIFACTS for the front-end to import.
The build also performs secondary tasks—such as updating README badges, site statistics, a sitemap, and an llms.txt file—but the core data consumed by the website is the auto-generated PHASES, GLOSSARY, and ARTIFACTS payload.
How to Run the Generator Locally
You can trigger the pipeline manually from the repository root.
node site/build.js
After execution, the console logs each stage:
📖 Reading source files...
🔍 Parsing ROADMAP.md...
🔍 Parsing README.md...
🔍 Parsing glossary/terms.md...
🔍 Discovering outputs + Phase 14 missions...
📚 Extracting lesson summaries + keywords from docs/en.md...
✅ Generated site/data.js
To inspect the output in a Node.js REPL or another script:
const { PHASES, GLOSSARY, ARTIFACTS } = require('./site/data.js');
console.log(PHASES.length); // number of phases
console.log(PHASES[0].lessons[0]); // first lesson object
console.log(GLOSSARY[0]); // first glossary entry
Summary
- The
site/build.jsscript is the sole build entry point that generatesdata.jsfromREADME.mdandROADMAP.mdin therohitg00/ai-engineering-from-scratchproject. parseRoadmap(lines 30-61) extracts completion status fromROADMAP.mdinto a nested map.parseReadme(lines 63-130) discovers phases and lessons fromREADME.mdand attaches the roadmap status to each entry.extractLessonMetaenriches lessons with summaries and keywords by reading individualdocs/en.mdfiles.- The final
PHASES,GLOSSARY, andARTIFACTSconstants are written tosite/data.jsat lines 49-58, producing a front-end-ready JavaScript module on every push.
Frequently Asked Questions
What triggers the generation of site/data.js?
The generation runs automatically on every push via GitHub Actions, but you can also invoke it locally by running node site/build.js from the repository root.
Which files does site/build.js read besides README.md and ROADMAP.md?
In addition to the two primary curriculum files, the script reads glossary/terms.md for domain definitions and scans each lesson's outputs/ folder to discover reusable artifacts.
How does a lesson receive its completion status in data.js?
The parseReadme function cross-references the nested status map built by parseRoadmap. Each lesson is matched by name; if no match is found in ROADMAP.md, the status defaults to "planned".
Can I use the generated data.js outside of the website?
Yes. Because site/data.js is a standard JavaScript module that exports PHASES, GLOSSARY, and ARTIFACTS, you can require or import it into any Node.js script or compatible bundler for custom reporting or integrations.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →