# How `audit_lessons.py` Validates `docs/en.md` Files: 4 Automated Checks Explained

> Discover the 4 automated validation checks in audit_lessons.py that ensure docs en md files are complete and consistent. Learn about existence, size, H1 headings and internal links.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: how-to-guide
- Published: 2026-06-09

---

**[`audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/audit_lessons.py) enforces four strict checks on every [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md) file—existence and UTF-8 readability, minimum size, a top-level H1 heading, and valid internal links—to ensure lesson documentation is complete and consistent.**

In the `rohitg00/ai-engineering-from-scratch` curriculum, each lesson must provide English documentation at [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md). While [`AGENTS.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/AGENTS.md) defines the repository-wide documentation guidelines, the mechanical enforcement lives in [`scripts/audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/audit_lessons.py), which scans every lesson directory to guarantee these files are present, well-formed, and internally consistent.

## Core Validation Rules for [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md)

The validation logic lives in [`scripts/audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/audit_lessons.py), specifically within the `check_docs_en_md` and `check_internal_links` functions. Each failed condition produces a machine-readable issue object containing a rule code, lesson path, file path, and diagnostic message.

### L002 — File Existence and UTF-8 Encoding

The script first verifies that [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md) exists inside the lesson directory. It then attempts to read the file as UTF-8 text, wrapping the operation in a `try/except UnicodeDecodeError` block.

```python

# From scripts/audit_lessons.py

if not doc.is_file():
    # Issue L002 recorded

```

If the file is missing or cannot be decoded as UTF-8, [`audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/audit_lessons.py) logs an **L002** issue.

### L003 — Minimum Document Size

Every [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md) must contain at least **200 bytes** of content. This threshold is defined by the constant `MIN_DOC_BYTES = 200`.

```python

# From scripts/audit_lessons.py

if len(text.encode("utf-8")) < MIN_DOC_BYTES:
    # Issue L003 recorded

```

Files that fall below this size trigger an **L003** diagnostic.

### L004 — Required Top-Level Heading

The Markdown must contain a level-1 heading. The script uses a compiled regular expression to confirm the presence of an H1:

```python

# From scripts/audit_lessons.py

H1_RE = re.compile(r"^#\s+\S", re.MULTILINE)

if not H1_RE.search(text):
    # Issue L004 recorded

```

Documents without a top-level heading generate an **L004** issue.

### L010 — Internal Links Must Resolve

The `check_internal_links` function walks every match of `MD_LINK_RE` to find relative Markdown links. It converts each matched path to an absolute `Path`, then verifies the target file exists inside the repository.

```python

# Conceptual flow from scripts/audit_lessons.py

for match in MD_LINK_RE.finditer(text):
    target = resolved_path / match.group("path")
    if not target.exists():
        # Issue L010 recorded

```

Broken relative links result in an **L010** failure.

## Running the Auditor

Invoke the script from the repository root. By default, it scans every phase.

```bash

# Audit all phases

python scripts/audit_lessons.py

```

Restrict the scan to a single phase with the `--phase` flag:

```bash

# Audit only phase 01

python scripts/audit_lessons.py --phase 01

```

For CI pipelines or automated reporting, output the results as JSON:

```bash

# JSON-formatted report

python scripts/audit_lessons.py --json

```

## Interpreting Audit Results

When issues are detected, [`audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/audit_lessons.py) prints a summary followed by individual diagnostics. A typical console report looks like this:

```text
audit_lessons.py — 120 lesson(s) checked, 8 issue(s)

  [L004] phases/01-introduction/01-hello-world/docs/en.md: docs/en.md missing top-level H1
  [L010] phases/02-basics/02-data-structures/docs/en.md: internal link does not resolve: "../03-models"

Summary by rule:
  L004: 4
  L010: 2
  L003: 1
  L002: 1

```

Each `Issue` contains the rule, lesson path, file path, and message. With the `--json` flag, the output is machine-readable:

```bash
$ python scripts/audit_lessons.py --json > report.json
$ cat report.json | jq '.issues[] | select(.rule=="L004")'
{
  "rule": "L004",
  "lesson": "phases/01-introduction/01-hello-world",
  "file": "phases/01-introduction/01-hello-world/docs/en.md",
  "message": "docs/en.md missing top-level H1"
}

```

## Summary

- **[`audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/audit_lessons.py)** mechanically verifies every lesson's [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md) in the `rohitg00/ai-engineering-from-scratch` repository.
- The four automated checks cover **file existence and UTF-8 encoding** (**L002**), a **minimum size of 200 bytes** (**L003**), a **top-level H1 heading** (**L004**), and **resolvable internal links** (**L010**).
- Key implementation files include [`scripts/audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/audit_lessons.py) with the `check_docs_en_md` and `check_internal_links` functions.
- Execution supports full-repository scans, per-phase filtering with `--phase`, and JSON export with `--json` for CI integration.

## Frequently Asked Questions

### What happens if a [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md) file is missing?

If the file does not exist at [`lesson/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/lesson/docs/en.md) or cannot be read as UTF-8, the `check_docs_en_md` function records an **L002** issue. The audit continues with the remaining lessons, but the missing or corrupt documentation is flagged for immediate correction.

### How does [`audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/audit_lessons.py) validate internal links?

The `check_internal_links` function applies `MD_LINK_RE` to extract relative links from the Markdown text. It builds an absolute `Path` for each target and calls `target.exists()`. Any link that fails to resolve to an existing file inside the repository produces an **L010** error.

### Can I run the audit against a single phase only?

Yes. Use the `--phase` argument followed by the phase identifier. For example, running `python scripts/audit_lessons.py --phase 01` will scan only phase `01`, skipping all other directories in the curriculum.

### Where are the validation thresholds defined?

The thresholds are declared directly in [`scripts/audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/audit_lessons.py). The minimum document size is controlled by `MIN_DOC_BYTES = 200`, and the top-level heading requirement is enforced with the compiled regex `H1_RE = re.compile(r"^#\s+\S", re.MULTILINE)`.