How AI Engineering from Scratch Audits Lesson Structures: Automated Curriculum Validation

The rohitg00/ai-engineering-from-scratch repository maintains curriculum integrity through an automated lesson-audit pipeline in scripts/audit_lessons.py that validates every lesson against 10+ strict rules—including directory naming, documentation standards, code presence, and quiz schema—blocking merges via GitHub Actions until all violations are resolved.

The rohitg00/ai-engineering-from-scratch project manages a massive catalog of 435 lessons, requiring rigorous standardization to ensure every module follows consistent structural conventions. To prevent drift and maintain quality, the repository implements a self-checking curriculum engine that automatically audits lesson directories against invariant rules on every pull request. This system validates everything from file naming patterns to JSON schema compliance, enforcing a zero-tolerance policy for structural violations before code reaches the main branch.

The Core Audit Pipeline

The validation logic resides in scripts/audit_lessons.py, which implements discrete check functions that inspect every lesson directory under phases/. When a rule violation is detected, the script instantiates an Issue object (defined at lines 38‑44) capturing the rule code, lesson path, file path, and a human-readable message. The script aggregates all issues and exits with status 1 if any exist, causing dependent CI jobs to fail.

Directory and Documentation Rules (L001‑L004)

The audit enforces strict organizational conventions through dedicated validation functions:

  • L001 – Directory Naming: check_lesson_dir_pattern() (lines 85‑94) validates that every lesson folder matches the NN-slug pattern (e.g., 01-intro-to-nn). This ensures lexical sorting correlates with curriculum progression.

  • L002‑L004 – Documentation Standards: check_docs_en_md() (lines 97‑116) verifies that each lesson contains a docs/en.md file that is UTF-8 encoded, exceeds 200 bytes in size, and begins with a top-level # heading. These constraints guarantee that lessons contain substantive, properly formatted English documentation.

Code and Quiz Validation (L005‑L009)

Beyond documentation, the audit verifies functional content and assessment integrity:

  • L005 – Code Presence: check_code_main() (lines 119‑127) ensures the code/ subdirectory contains at least one non-ignored source file, preventing empty lesson shells.

  • L006‑L009 – Quiz Schema: check_quiz() (lines 129‑194) performs deep validation on quiz.json. It verifies valid JSON syntax and enforces the canonical schema requiring fields for stage, question, options, correct, and explanation. The function also validates that the options array contains between 2 and 6 items (enforced via MIN_OPTIONS = 2 and MAX_OPTIONS = 6 at lines 76‑84), and that the correct index falls within range.

  • L007 – Legacy Format Detection: Within check_quiz() (lines 57‑66), the audit detects obsolete key schemas like q/choices/answer and surfaces warnings, ensuring the curriculum migrates uniformly to the modern format.

The pipeline validates internal documentation links through check_internal_links() (lines 96‑112). This rule resolves all relative paths in docs/en.md against the repository filesystem, ensuring that cross-references between lessons remain valid as the curriculum evolves.

Continuous Integration Enforcement

The .github/workflows/curriculum.yml workflow automates the audit on every push to main and every pull request affecting curriculum files. The workflow executes three sequential steps: repository checkout, Python 3.12 setup, and execution of python3 scripts/audit_lessons.py. If the script returns a non-zero exit code due to rule violations, the workflow fails and blocks the PR from merging until developers resolve the underlying issues.

Automated README Synchronization

To prevent documentation drift, the repository includes a complementary validator: scripts/check_readme_counts.py. This script compares hard-coded lesson, phase, and skill counts in README.md against authoritative totals stored in catalog.json.

The curriculum.yml workflow defines a separate readme-counts-sync job that executes only on pushes to main. This job runs the script with the --fix flag, which rewrites README.md in-place with corrected statistics, then automatically commits the changes. This self-healing mechanism ensures the public-facing curriculum summary accurately reflects the actual 435-lesson catalog without requiring manual updates.

Local Execution and Debugging

Developers can run the audit locally to catch violations before submitting pull requests:


# Generate human-readable report

python3 scripts/audit_lessons.py

# Output JSON for integration with other tooling

python3 scripts/audit_lessons.py --json

When violations occur, the script emits structured messages like the following:


[L004] phases/01-foundations/01-math-prereqs/docs/en.md: docs/en.md missing top-level H1

To synchronize README counts manually:

python3 scripts/check_readme_counts.py --fix
git diff README.md  # Review changes before committing

Summary

  • Automated Validation: The scripts/audit_lessons.py pipeline enforces 10+ invariant rules (L001‑L010) covering directory naming (NN-slug), documentation requirements (UTF-8, ≥200 bytes, H1 heading), code presence, and strict quiz JSON schemas.
  • CI Blocking: The .github/workflows/curriculum.yml workflow executes the audit on every PR, exiting with status 1 when violations are detected to prevent broken lessons from merging.
  • Schema Rigor: Quiz validation includes bounds checking (2‑6 options), legacy format detection, and link resolution to maintain curriculum-wide consistency.
  • Self-Healing Docs: The check_readme_counts.py utility with --fix automatically synchronizes README.md statistics against catalog.json on every push to main.

Frequently Asked Questions

What triggers the lesson audit in AI Engineering from Scratch?

The audit triggers automatically via the curriculum GitHub Actions workflow defined in .github/workflows/curriculum.yml. It runs on every push to the main branch and every pull request that modifies curriculum files, executing python3 scripts/audit_lessons.py to scan all lesson directories under phases/. If the script detects any rule violations, the workflow fails and prevents the PR from merging.

How does the audit handle quiz schema violations?

The check_quiz() function validates that every quiz.json file contains valid JSON adhering to the canonical schema with fields for stage, question, options, correct, and explanation. It enforces L008 by verifying that the options array contains between 2 and 6 items using the MIN_OPTIONS and MAX_OPTIONS constants, and ensures the correct index points to a valid option. Legacy key formats such as q/choices/answer trigger L007 warnings to prompt migration to the current schema.

Can I run the lesson audit locally before submitting a PR?

Yes. Execute python3 scripts/audit_lessons.py from the repository root to receive a human-readable report, or append --json for machine-parseable output. The script replicates the CI environment exactly, including the Issue object aggregation (lines 38‑44) and exit code behavior, allowing you to fix structural errors—such as invalid directory names or missing docs/en.md files—before the workflow blocks your pull request.

Why does the README.md update automatically after pushes to main?

The readme-counts-sync job in the curriculum workflow runs only on the main branch, executing python3 scripts/check_readme_counts.py --fix. This command compares the hard-coded lesson counts in README.md against the authoritative catalog.json and rewrites the markdown file in-place with accurate statistics. The workflow then commits these changes automatically, ensuring the repository documentation remains synchronized with the actual curriculum structure without manual intervention.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →