How `audit_lessons.py` Validates `docs/en.md` Files: 4 Automated Checks Explained

audit_lessons.py enforces four strict checks on every docs/en.md file—existence and UTF-8 readability, minimum size, a top-level H1 heading, and valid internal links—to ensure lesson documentation is complete and consistent.

In the rohitg00/ai-engineering-from-scratch curriculum, each lesson must provide English documentation at docs/en.md. While AGENTS.md defines the repository-wide documentation guidelines, the mechanical enforcement lives in scripts/audit_lessons.py, which scans every lesson directory to guarantee these files are present, well-formed, and internally consistent.

Core Validation Rules for docs/en.md

The validation logic lives in scripts/audit_lessons.py, specifically within the check_docs_en_md and check_internal_links functions. Each failed condition produces a machine-readable issue object containing a rule code, lesson path, file path, and diagnostic message.

L002 — File Existence and UTF-8 Encoding

The script first verifies that docs/en.md exists inside the lesson directory. It then attempts to read the file as UTF-8 text, wrapping the operation in a try/except UnicodeDecodeError block.


# From scripts/audit_lessons.py

if not doc.is_file():
    # Issue L002 recorded

If the file is missing or cannot be decoded as UTF-8, audit_lessons.py logs an L002 issue.

L003 — Minimum Document Size

Every docs/en.md must contain at least 200 bytes of content. This threshold is defined by the constant MIN_DOC_BYTES = 200.


# From scripts/audit_lessons.py

if len(text.encode("utf-8")) < MIN_DOC_BYTES:
    # Issue L003 recorded

Files that fall below this size trigger an L003 diagnostic.

L004 — Required Top-Level Heading

The Markdown must contain a level-1 heading. The script uses a compiled regular expression to confirm the presence of an H1:


# From scripts/audit_lessons.py

H1_RE = re.compile(r"^#\s+\S", re.MULTILINE)

if not H1_RE.search(text):
    # Issue L004 recorded

Documents without a top-level heading generate an L004 issue.

The check_internal_links function walks every match of MD_LINK_RE to find relative Markdown links. It converts each matched path to an absolute Path, then verifies the target file exists inside the repository.


# Conceptual flow from scripts/audit_lessons.py

for match in MD_LINK_RE.finditer(text):
    target = resolved_path / match.group("path")
    if not target.exists():
        # Issue L010 recorded

Broken relative links result in an L010 failure.

Running the Auditor

Invoke the script from the repository root. By default, it scans every phase.


# Audit all phases

python scripts/audit_lessons.py

Restrict the scan to a single phase with the --phase flag:


# Audit only phase 01

python scripts/audit_lessons.py --phase 01

For CI pipelines or automated reporting, output the results as JSON:


# JSON-formatted report

python scripts/audit_lessons.py --json

Interpreting Audit Results

When issues are detected, audit_lessons.py prints a summary followed by individual diagnostics. A typical console report looks like this:

audit_lessons.py — 120 lesson(s) checked, 8 issue(s)

  [L004] phases/01-introduction/01-hello-world/docs/en.md: docs/en.md missing top-level H1
  [L010] phases/02-basics/02-data-structures/docs/en.md: internal link does not resolve: "../03-models"

Summary by rule:
  L004: 4
  L010: 2
  L003: 1
  L002: 1

Each Issue contains the rule, lesson path, file path, and message. With the --json flag, the output is machine-readable:

$ python scripts/audit_lessons.py --json > report.json
$ cat report.json | jq '.issues[] | select(.rule=="L004")'
{
  "rule": "L004",
  "lesson": "phases/01-introduction/01-hello-world",
  "file": "phases/01-introduction/01-hello-world/docs/en.md",
  "message": "docs/en.md missing top-level H1"
}

Summary

  • audit_lessons.py mechanically verifies every lesson's docs/en.md in the rohitg00/ai-engineering-from-scratch repository.
  • The four automated checks cover file existence and UTF-8 encoding (L002), a minimum size of 200 bytes (L003), a top-level H1 heading (L004), and resolvable internal links (L010).
  • Key implementation files include scripts/audit_lessons.py with the check_docs_en_md and check_internal_links functions.
  • Execution supports full-repository scans, per-phase filtering with --phase, and JSON export with --json for CI integration.

Frequently Asked Questions

What happens if a docs/en.md file is missing?

If the file does not exist at lesson/docs/en.md or cannot be read as UTF-8, the check_docs_en_md function records an L002 issue. The audit continues with the remaining lessons, but the missing or corrupt documentation is flagged for immediate correction.

The check_internal_links function applies MD_LINK_RE to extract relative links from the Markdown text. It builds an absolute Path for each target and calls target.exists(). Any link that fails to resolve to an existing file inside the repository produces an L010 error.

Can I run the audit against a single phase only?

Yes. Use the --phase argument followed by the phase identifier. For example, running python scripts/audit_lessons.py --phase 01 will scan only phase 01, skipping all other directories in the curriculum.

Where are the validation thresholds defined?

The thresholds are declared directly in scripts/audit_lessons.py. The minimum document size is controlled by MIN_DOC_BYTES = 200, and the top-level heading requirement is enforced with the compiled regex H1_RE = re.compile(r"^#\s+\S", re.MULTILINE).

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →