How `audit_lessons.py` Validates `docs/en.md` Files: 4 Automated Checks Explained
audit_lessons.py enforces four strict checks on every docs/en.md file—existence and UTF-8 readability, minimum size, a top-level H1 heading, and valid internal links—to ensure lesson documentation is complete and consistent.
In the rohitg00/ai-engineering-from-scratch curriculum, each lesson must provide English documentation at docs/en.md. While AGENTS.md defines the repository-wide documentation guidelines, the mechanical enforcement lives in scripts/audit_lessons.py, which scans every lesson directory to guarantee these files are present, well-formed, and internally consistent.
Core Validation Rules for docs/en.md
The validation logic lives in scripts/audit_lessons.py, specifically within the check_docs_en_md and check_internal_links functions. Each failed condition produces a machine-readable issue object containing a rule code, lesson path, file path, and diagnostic message.
L002 — File Existence and UTF-8 Encoding
The script first verifies that docs/en.md exists inside the lesson directory. It then attempts to read the file as UTF-8 text, wrapping the operation in a try/except UnicodeDecodeError block.
# From scripts/audit_lessons.py
if not doc.is_file():
# Issue L002 recorded
If the file is missing or cannot be decoded as UTF-8, audit_lessons.py logs an L002 issue.
L003 — Minimum Document Size
Every docs/en.md must contain at least 200 bytes of content. This threshold is defined by the constant MIN_DOC_BYTES = 200.
# From scripts/audit_lessons.py
if len(text.encode("utf-8")) < MIN_DOC_BYTES:
# Issue L003 recorded
Files that fall below this size trigger an L003 diagnostic.
L004 — Required Top-Level Heading
The Markdown must contain a level-1 heading. The script uses a compiled regular expression to confirm the presence of an H1:
# From scripts/audit_lessons.py
H1_RE = re.compile(r"^#\s+\S", re.MULTILINE)
if not H1_RE.search(text):
# Issue L004 recorded
Documents without a top-level heading generate an L004 issue.
L010 — Internal Links Must Resolve
The check_internal_links function walks every match of MD_LINK_RE to find relative Markdown links. It converts each matched path to an absolute Path, then verifies the target file exists inside the repository.
# Conceptual flow from scripts/audit_lessons.py
for match in MD_LINK_RE.finditer(text):
target = resolved_path / match.group("path")
if not target.exists():
# Issue L010 recorded
Broken relative links result in an L010 failure.
Running the Auditor
Invoke the script from the repository root. By default, it scans every phase.
# Audit all phases
python scripts/audit_lessons.py
Restrict the scan to a single phase with the --phase flag:
# Audit only phase 01
python scripts/audit_lessons.py --phase 01
For CI pipelines or automated reporting, output the results as JSON:
# JSON-formatted report
python scripts/audit_lessons.py --json
Interpreting Audit Results
When issues are detected, audit_lessons.py prints a summary followed by individual diagnostics. A typical console report looks like this:
audit_lessons.py — 120 lesson(s) checked, 8 issue(s)
[L004] phases/01-introduction/01-hello-world/docs/en.md: docs/en.md missing top-level H1
[L010] phases/02-basics/02-data-structures/docs/en.md: internal link does not resolve: "../03-models"
Summary by rule:
L004: 4
L010: 2
L003: 1
L002: 1
Each Issue contains the rule, lesson path, file path, and message. With the --json flag, the output is machine-readable:
$ python scripts/audit_lessons.py --json > report.json
$ cat report.json | jq '.issues[] | select(.rule=="L004")'
{
"rule": "L004",
"lesson": "phases/01-introduction/01-hello-world",
"file": "phases/01-introduction/01-hello-world/docs/en.md",
"message": "docs/en.md missing top-level H1"
}
Summary
audit_lessons.pymechanically verifies every lesson'sdocs/en.mdin therohitg00/ai-engineering-from-scratchrepository.- The four automated checks cover file existence and UTF-8 encoding (L002), a minimum size of 200 bytes (L003), a top-level H1 heading (L004), and resolvable internal links (L010).
- Key implementation files include
scripts/audit_lessons.pywith thecheck_docs_en_mdandcheck_internal_linksfunctions. - Execution supports full-repository scans, per-phase filtering with
--phase, and JSON export with--jsonfor CI integration.
Frequently Asked Questions
What happens if a docs/en.md file is missing?
If the file does not exist at lesson/docs/en.md or cannot be read as UTF-8, the check_docs_en_md function records an L002 issue. The audit continues with the remaining lessons, but the missing or corrupt documentation is flagged for immediate correction.
How does audit_lessons.py validate internal links?
The check_internal_links function applies MD_LINK_RE to extract relative links from the Markdown text. It builds an absolute Path for each target and calls target.exists(). Any link that fails to resolve to an existing file inside the repository produces an L010 error.
Can I run the audit against a single phase only?
Yes. Use the --phase argument followed by the phase identifier. For example, running python scripts/audit_lessons.py --phase 01 will scan only phase 01, skipping all other directories in the curriculum.
Where are the validation thresholds defined?
The thresholds are declared directly in scripts/audit_lessons.py. The minimum document size is controlled by MIN_DOC_BYTES = 200, and the top-level heading requirement is enforced with the compiled regex H1_RE = re.compile(r"^#\s+\S", re.MULTILINE).
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →