# How the audit_lessons.py CI Validation Script Enforces Lesson Quality

> Learn how the audit_lessons.py CI validation script ensures lesson quality. Discover how it checks documentation, code, quizzes, and links in rohitg00/ai-engineering-from-scratch.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: how-to-guide
- Published: 2026-06-07

---

**The [`audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/audit_lessons.py) CI validation script walks the `phases/` tree in `rohitg00/ai-engineering-from-scratch`, validates every `NN-slug` lesson directory for documentation, code, quizzes, and internal links, and exits with code `1` if any structural invariant is violated.**

The [`audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/audit_lessons.py) script acts as the single source of truth for curriculum linting in the **ai-engineering-from-scratch** repository. It is designed to guarantee consistency across hundreds of lessons by catching structural deviations before they reach the main branch. According to the repository's source code, the script performs a multi-stage audit that surfaces issues with specific rule codes like **L001** through **L010**.

## How the Audit Is Triggered in CI

The script runs automatically inside the **`audit`** job defined in [`.github/workflows/curriculum.yml`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/.github/workflows/curriculum.yml). This workflow invokes `python scripts/audit_lessons.py` on every push and pull request, ensuring that no lesson can be merged while violating the repository's structure. The requirement is also codified in **[`AGENTS.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/AGENTS.md)**, which mandates the audit step as a repository-wide policy.

## Command-Line Interface and Flags

Before scanning any directories, the script parses three optional flags in `parser.add_argument` (lines 45‑52). These flags control the scan scope, output format, and failure severity:

- **`--phase`** — Restricts the scan to a single numeric phase (for example, `--phase 12`).
- **`--json`** — Emits a structured JSON report instead of human-readable text.
- **`--strict`** — Forces warnings to be treated as failures (currently equivalent to the default behavior).

You can run the validator locally with the same commands used in CI:

```bash

# Run the audit on the whole repo (default human-readable output)

python scripts/audit_lessons.py

# Audit only phase 12 and output JSON for automated tooling

python scripts/audit_lessons.py --phase 12 --json > phase12_report.json

# Treat warnings as failures

python scripts/audit_lessons.py --strict

```

## Lesson Discovery and Directory Naming

The `iter_lesson_dirs` function (lines 65‑82) discovers lessons by walking the `phases/` directory tree. It yields every sub-directory that matches the **`NN-slug`** pattern. If `--phase` is supplied, only folders belonging to that numeric phase are visited.

Each folder name must satisfy the regular expression `^[0-9]{2}-[a-z0-9][a-z0-9-]*[a-z0-9]$`. The `check_lesson_dir_pattern` validator (lines 85‑94) enforces this rule, reporting violations as **L001**.

## Documentation Checks for docs/en.md

Every lesson must contain a [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md) file. The `check_docs_en_md` function (lines 97‑116) applies four separate checks:

1. **Existence** — Missing files raise **L002**.
2. **UTF-8 validity** — Files that cannot be decoded as UTF-8 also trigger **L002**.
3. **Minimum size** — The file must be at least **200 bytes**; smaller files raise **L003**.
4. **Top-level H1** — The document must start with a top-level heading; absence raises **L004**.

These rules guarantee that every lesson ships with a readable, properly formatted English explanation.

## Code Directory Sanity

The `check_code_main` function (lines 119‑127) inspects the `code/` folder inside each lesson. It confirms that the directory contains at least one non-ignored file. Files such as [`README.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/README.md) and `.gitkeep` are explicitly ignored. If the directory is effectively empty, the script raises **L005**, preventing lessons from being published without source or configuration files.

## Quiz Schema Validation

Quizzes are stored as [`quiz.json`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/quiz.json) and validated by the `check_quiz` logic in [`scripts/audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/audit_lessons.py). This is the most detailed check in the script and produces several distinct error codes:

- **L006** — The file must be valid JSON and, if it uses a dictionary wrapper, must contain a `questions` array. All question objects must include the required keys: `stage`, `question`, `options`, `correct`, and `explanation`.
- **L007** — Legacy keys such as `q`, `choices`, and `answer` are rejected.
- **L008** — The `options` array must contain between **2 and 6** items.
- **L009** — The `correct` field must be a valid integer index into the `options` array.

By enforcing this schema, the script ensures that every quiz is machine-parseable and learner-ready.

## Internal Markdown Link Verification

The `check_internal_links` function resolves every Markdown link found in [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md). Links that are not absolute URLs are resolved relative to the repository root. If a referenced file does not exist on disk, the script emits **L010**. This catches broken cross-references before they reach learners.

## Issue Aggregation and Exit Behavior

All failures are stored as **Issue** objects inside the `Audit` dataclass (lines 54‑58). Each issue records the rule ID, the lesson path, the file involved, and a descriptive message.

At the end of the run, `render_report` (lines 25‑42) formats the results, while the `main` entry point (lines 44‑73) orchestrates overall execution. In the default mode, the script prints a human-readable summary listing every violation grouped by rule. With `--json`, it returns a structured payload suitable for downstream automation. The script exits with code **1** if any issue is found and **0** otherwise, making it safe to use as a CI gate.

Typical output looks like this:

```text
audit_lessons.py — 435 lesson(s) checked, 3 issue(s)

  [L005] phases/03-linear-algebra/lesson-02-matrix-multiplication/code: code/ is empty (no source or config files)
  [L010] phases/07-optimizers/lesson-04-sgd/docs/en.md: internal link does not resolve: '../nonexistent.md'

Summary by rule:
  L005: 1
  L010: 2

```

## Summary

- The [`audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/audit_lessons.py) script is a CI gate defined in [`.github/workflows/curriculum.yml`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/.github/workflows/curriculum.yml) that validates every lesson in `rohitg00/ai-engineering-from-scratch`.
- It discovers lessons via `iter_lesson_dirs`, enforces folder naming with `check_lesson_dir_pattern`, and validates [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md), `code/`, and [`quiz.json`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/quiz.json).
- Specific rule codes (**L001**–**L010**) identify exactly which structural invariant failed.
- The script exits with code `1` when any issue is detected, preventing non-compliant curriculum changes from merging.

## Frequently Asked Questions

### What does the audit_lessons.py script check in each lesson?

According to the `rohitg00/ai-engineering-from-scratch` source code, the script checks directory naming, the presence and quality of [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md), non-empty `code/` directories, a strict [`quiz.json`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/quiz.json) schema, and valid internal Markdown links. It maps every failure to a machine-readable rule code such as **L001** or **L010** so contributors know exactly what to fix. These checks run automatically in CI via [`.github/workflows/curriculum.yml`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/.github/workflows/curriculum.yml).

### How do I run the audit for a single phase only?

Pass the `--phase` flag followed by the phase number. For example, `python scripts/audit_lessons.py --phase 12` scans only phase 12 and skips the rest of the curriculum. This is useful for local debugging or for generating targeted JSON reports with `--json`.

### What is the difference between `--strict` and the default behavior?

As implemented in the repository, `--strict` currently behaves the same as the default mode. Both configurations treat audit failures as fatal and exit with code `1` if any rule violation is found. Future iterations of the script may differentiate the two flags more granularly.

### Where is the audit script triggered in CI?

The script is triggered by the **`audit`** job inside [`.github/workflows/curriculum.yml`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/.github/workflows/curriculum.yml). It runs automatically on every push and pull request. Its use is required by the repository policy documented in [`AGENTS.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/AGENTS.md).