Invariant Validation Checks and Rules in audit_lessons.py: Complete Guide to the Curriculum Auditor

The audit_lessons.py script enforces ten structural invariant rules (L001–L010) that validate lesson directory naming, documentation completeness, code presence, and quiz schema integrity across the AI Engineering from Scratch curriculum.

The audit_lessons.py file in the rohitg00/ai-engineering-from-scratch repository serves as the gatekeeper for curriculum quality, walking every lesson directory under phases/ to ensure consistent structure. These invariant validation checks guarantee that each lesson meets strict standards for documentation, runnable code, and assessment data before merging. Understanding these rules helps contributors fix validation errors and maintain the repository's pedagogical integrity.

How the Audit System Works

The validation engine centers on the audit_lesson() function (lines 14–22) which orchestrates the inspection sequence. For every lesson discovered, the auditor executes checks in the following order: L001 → L004 → L005 → L006 → L008–L009 → L010. Each failed check registers an issue tagged with a canonical rule code, enabling precise error tracking and automated CI integration.

The 10 Invariant Validation Rules (L001–L010)

Naming and Directory Structure (L001–L002)

L001 – Lesson Directory Naming Directory names must match the pattern NN-slug, requiring a two-digit phase number, a hyphen, and lowercase alphanumerics or hyphens. This convention ensures consistent URL slugs and sorting across the curriculum. The regex validation appears in [scripts/audit_lessons.py lines 85–93](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/audit_lessons.py#L85-L93).

L002 – Presence of docs/en.md Every lesson must contain a markdown file at docs/en.md. This file serves as the primary English-language learning material. The existence check runs at lines 99–101.

Documentation Quality Standards (L003–L004, L010)

L003 – Minimum Documentation Size The docs/en.md file must contain at least 200 bytes of content. Files falling below this threshold are flagged as incomplete stubs. This length check executes at lines 107–113.

L004 – Top-Level Heading The markdown must contain at least one H1 heading (# …) to ensure the lesson displays a visible title in rendered documentation. The regex scan appears at lines 114–116.

L010 – Internal Markdown Links All relative links inside docs/en.md must resolve to existing files or directories within the repository. External URLs, mailto links, and data URIs are ignored. This prevents broken cross-references between lessons. The link resolution logic spans lines 996–1012.

Code Presence Validation (L005)

L005 – Non-Empty Code Directory If a lesson includes a code/ folder, it must contain at least one source or configuration file, ignoring a small whitelist of system files (like .DS_Store or __pycache__). This ensures that lessons advertising code examples actually provide runnable material. The emptiness check runs at lines 119–127.

Quiz Schema Integrity (L006–L009)

L006 – Quiz JSON Schema The quiz.json file (if present) must be valid JSON containing a non-empty questions[] array. Each question must include all canonical keys: stage, question, options, correct, and explanation. Missing keys, empty arrays, or malformed JSON trigger this error. See the validation at lines 133–152.

L007 – Legacy Quiz Schema Detection This rule detects usage of deprecated schema keys (q, choices, answer) and triggers a warning encouraging migration to the current canonical schema. It helps maintain data consistency across the curriculum. The legacy-key detection appears at lines 157–166.

L008 – Options Length Each options array within a quiz question must contain 2 to 6 entries. Values outside this range indicate either insufficient choices or excessive complexity. The bounds check executes at lines 176–184.

L009 – Correct Answer Index The correct field must be an integer index satisfying 0 ≤ correct < len(options), ensuring the correct answer points to a valid option position. This prevents out-of-bound references in quiz rendering. The index validation runs at lines 186–193.

Running the Audit Locally

Execute the validator from the repository root to scan the entire curriculum:


# Human-readable report

python scripts/audit_lessons.py

# JSON output for CI pipelines

python scripts/audit_lessons.py --json

# Limit audit to a specific phase (e.g., Phase 19)

python scripts/audit_lessons.py --phase 19

Sample output format:


audit_lessons.py — 435 lesson(s) checked, 3 issue(s)

  [L003] phases/08-generative-ai/04-conditional-gans-pix2pix/docs/en.md: docs/en.md shorter than 200 bytes (got 128)
  [L008] phases/10-llms-from-scratch/12-inference-optimization/quiz.json: question[2] options length must be 2..6 (got 1)
  [L010] phases/19-capstone-projects/84-refusal-evaluation/docs/en.md: internal link does not resolve: ./nonexistent.md

Summary by rule:
  L003: 1
  L008: 1
  L010: 1

Programmatically parsing JSON output:

import json
import subprocess

result = subprocess.run(
    ["python", "scripts/audit_lessons.py", "--json"],
    capture_output=True, text=True, check=True
)
audit = json.loads(result.stdout)
print(f"Checked {audit['lessons_checked']} lessons, found {len(audit['issues'])} issues")

for issue in audit["issues"]:
    print(f"[{issue['rule']}] {issue['file']}: {issue['message']}")

Summary

  • audit_lessons.py implements ten invariant validation rules (L001–L010) that enforce directory naming, documentation standards, code presence, and quiz integrity.
  • L001–L002 validate physical structure and required files (NN-slug naming and docs/en.md existence).
  • L003–L004 ensure documentation quality (minimum 200 bytes, H1 presence).
  • L005 requires non-empty code/ directories when present.
  • L006–L009 enforce strict quiz JSON schema including valid options length (2–6) and correct answer indexing.
  • L010 prevents broken internal links by verifying all relative Markdown references resolve to existing repository paths.
  • The audit_lesson() function at lines 14–22 orchestrates checks in a specific sequence, registering violations with canonical rule codes for actionable CI feedback.

Frequently Asked Questions

What happens if a lesson fails the L001 naming convention check?

The audit rejects the lesson with a violation indicating the directory name does not match the NN-slug pattern. Contributors must rename the folder to include a two-digit phase number followed by a hyphen and lowercase alphanumeric characters before the CI pipeline will pass.

How does the audit handle missing quiz.json files?

Quiz validation rules (L006–L009) apply only when quiz.json exists. Lessons without quizzes skip these checks entirely, but they must still satisfy documentation rules (L002–L004, L010) and the code presence rule (L005) if applicable.

Can I run the audit on a specific phase only?

Yes. Use the --phase flag followed by the phase number to limit validation to a single phase. For example, python scripts/audit_lessons.py --phase 19 audits only Phase 19, reducing execution time when developing or validating specific curriculum sections.

What is the difference between L006 and L007 quiz validation rules?

L006 enforces the current canonical schema requiring keys stage, question, options, correct, and explanation. L007 specifically detects legacy schema usage (q, choices, answer) and issues a warning rather than an error, serving as a migration reminder for older lesson content.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →