how-to-guide

How the audit_lessons.py CI Validation Script Enforces Lesson Quality

June 7, 2026 rohitg00/ai-engineering-from-scratch ↗

The audit_lessons.py CI validation script walks the phases/ tree in rohitg00/ai-engineering-from-scratch, validates every NN-slug lesson directory for documentation, code, quizzes, and internal links, and exits with code 1 if any structural invariant is violated.

The audit_lessons.py script acts as the single source of truth for curriculum linting in the ai-engineering-from-scratch repository. It is designed to guarantee consistency across hundreds of lessons by catching structural deviations before they reach the main branch. According to the repository's source code, the script performs a multi-stage audit that surfaces issues with specific rule codes like L001 through L010.

How the Audit Is Triggered in CI

The script runs automatically inside the audit job defined in .github/workflows/curriculum.yml. This workflow invokes python scripts/audit_lessons.py on every push and pull request, ensuring that no lesson can be merged while violating the repository's structure. The requirement is also codified in AGENTS.md, which mandates the audit step as a repository-wide policy.

Command-Line Interface and Flags

Before scanning any directories, the script parses three optional flags in parser.add_argument (lines 45‑52). These flags control the scan scope, output format, and failure severity:

--phase — Restricts the scan to a single numeric phase (for example, --phase 12).
--json — Emits a structured JSON report instead of human-readable text.
--strict — Forces warnings to be treated as failures (currently equivalent to the default behavior).

You can run the validator locally with the same commands used in CI:


# Run the audit on the whole repo (default human-readable output)

python scripts/audit_lessons.py

# Audit only phase 12 and output JSON for automated tooling

python scripts/audit_lessons.py --phase 12 --json > phase12_report.json

# Treat warnings as failures

python scripts/audit_lessons.py --strict

Lesson Discovery and Directory Naming

The iter_lesson_dirs function (lines 65‑82) discovers lessons by walking the phases/ directory tree. It yields every sub-directory that matches the NN-slug pattern. If --phase is supplied, only folders belonging to that numeric phase are visited.

Each folder name must satisfy the regular expression ^[0-9]{2}-[a-z0-9][a-z0-9-]*[a-z0-9]$. The check_lesson_dir_pattern validator (lines 85‑94) enforces this rule, reporting violations as L001.

Documentation Checks for docs/en.md

Every lesson must contain a docs/en.md file. The check_docs_en_md function (lines 97‑116) applies four separate checks:

Existence — Missing files raise L002.
UTF-8 validity — Files that cannot be decoded as UTF-8 also trigger L002.
Minimum size — The file must be at least 200 bytes; smaller files raise L003.
Top-level H1 — The document must start with a top-level heading; absence raises L004.

These rules guarantee that every lesson ships with a readable, properly formatted English explanation.

Code Directory Sanity

The check_code_main function (lines 119‑127) inspects the code/ folder inside each lesson. It confirms that the directory contains at least one non-ignored file. Files such as README.md and .gitkeep are explicitly ignored. If the directory is effectively empty, the script raises L005, preventing lessons from being published without source or configuration files.

Quiz Schema Validation

Quizzes are stored as quiz.json and validated by the check_quiz logic in scripts/audit_lessons.py. This is the most detailed check in the script and produces several distinct error codes:

L006 — The file must be valid JSON and, if it uses a dictionary wrapper, must contain a questions array. All question objects must include the required keys: stage, question, options, correct, and explanation.
L007 — Legacy keys such as q, choices, and answer are rejected.
L008 — The options array must contain between 2 and 6 items.
L009 — The correct field must be a valid integer index into the options array.

By enforcing this schema, the script ensures that every quiz is machine-parseable and learner-ready.

Internal Markdown Link Verification

The check_internal_links function resolves every Markdown link found in docs/en.md. Links that are not absolute URLs are resolved relative to the repository root. If a referenced file does not exist on disk, the script emits L010. This catches broken cross-references before they reach learners.

Issue Aggregation and Exit Behavior

All failures are stored as Issue objects inside the Audit dataclass (lines 54‑58). Each issue records the rule ID, the lesson path, the file involved, and a descriptive message.

At the end of the run, render_report (lines 25‑42) formats the results, while the main entry point (lines 44‑73) orchestrates overall execution. In the default mode, the script prints a human-readable summary listing every violation grouped by rule. With --json, it returns a structured payload suitable for downstream automation. The script exits with code 1 if any issue is found and 0 otherwise, making it safe to use as a CI gate.

Typical output looks like this:

audit_lessons.py — 435 lesson(s) checked, 3 issue(s)

  [L005] phases/03-linear-algebra/lesson-02-matrix-multiplication/code: code/ is empty (no source or config files)
  [L010] phases/07-optimizers/lesson-04-sgd/docs/en.md: internal link does not resolve: '../nonexistent.md'

Summary by rule:
  L005: 1
  L010: 2

Summary

The audit_lessons.py script is a CI gate defined in .github/workflows/curriculum.yml that validates every lesson in rohitg00/ai-engineering-from-scratch.
It discovers lessons via iter_lesson_dirs, enforces folder naming with check_lesson_dir_pattern, and validates docs/en.md, code/, and quiz.json.
Specific rule codes (L001–L010) identify exactly which structural invariant failed.
The script exits with code 1 when any issue is detected, preventing non-compliant curriculum changes from merging.

Frequently Asked Questions

What does the audit_lessons.py script check in each lesson?

According to the rohitg00/ai-engineering-from-scratch source code, the script checks directory naming, the presence and quality of docs/en.md, non-empty code/ directories, a strict quiz.json schema, and valid internal Markdown links. It maps every failure to a machine-readable rule code such as L001 or L010 so contributors know exactly what to fix. These checks run automatically in CI via .github/workflows/curriculum.yml.

How do I run the audit for a single phase only?

Pass the --phase flag followed by the phase number. For example, python scripts/audit_lessons.py --phase 12 scans only phase 12 and skips the rest of the curriculum. This is useful for local debugging or for generating targeted JSON reports with --json.

What is the difference between `--strict` and the default behavior?

As implemented in the repository, --strict currently behaves the same as the default mode. Both configurations treat audit failures as fatal and exit with code 1 if any rule violation is found. Future iterations of the script may differentiate the two flags more granularly.

Where is the audit script triggered in CI?

The script is triggered by the audit job inside .github/workflows/curriculum.yml. It runs automatically on every push and pull request. Its use is required by the repository policy documented in AGENTS.md.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how rohitg00/ai-engineering-from-scratch works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →