# How the AI Engineering from Scratch Curriculum Validates Internal Links in Lesson Documentation

> Learn how the AI Engineering from Scratch curriculum automatically validates internal links in lesson documentation using a Python script to ensure all cross-references are accurate and functional.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: how-to-guide
- Published: 2026-06-09

---

**The AI‑Engineering‑from‑Scratch curriculum validates internal links in lesson documentation by executing [`scripts/audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/audit_lessons.py), which scans every lesson's [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md), resolves each Markdown href against the repository filesystem, and raises rule `L010` whenever a target path does not exist.**

Maintaining reliable navigation across dozens of lesson files requires automated checks. The `rohitg00/ai-engineering-from-scratch` repository tackles this with a purpose‑built audit script that validates internal links in lesson documentation before changes ever reach learners. As implemented in the project source code, the pipeline parses Markdown cross‑references, distinguishes external URLs from internal paths, and enforces link integrity on every CI build.

## The Entry Point: [`scripts/audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/audit_lessons.py)

The entire validation workflow is centralized in [`scripts/audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/audit_lessons.py). This script is invoked both locally by contributors and automatically in GitHub Actions via the `audit` job defined in [`.github/workflows/curriculum.yml`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/.github/workflows/curriculum.yml). The curriculum's operational rules, including this audit workflow, are outlined in [`AGENTS.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/AGENTS.md), while [`README.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/README.md) serves as the repository entry point that references the automated CI pipeline.

### Walking the Curriculum Structure with `iter_lesson_dirs()`

The audit begins by discovering content. The `iter_lesson_dirs()` function walks every phase directory matching the pattern `phases/NN-.../` and yields each individual lesson folder. This guarantees that no lesson is skipped during the scan.

### Loading Lesson Content via `check_docs_en_md()`

For each discovered lesson, `check_docs_en_md()` loads the raw Markdown text from [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md). This file serves as the canonical English‑language lesson document and is the single source of truth for link validation.

## Parsing and Filtering Markdown Links

Once the raw text is in memory, the script isolates every hyperlink and strips away external dependencies.

### Extracting Hrefs with `MD_LINK_RE`

The script compiles a regular expression named `MD_LINK_RE` to extract the destination of each Markdown link:

```python
MD_LINK_RE = re.compile(r"\[[^\]]*\]\(([^)\s#]+)(?:#[^)]*)?\)")

```

This pattern captures the href portion of `[]()` syntax while ignoring inline `#` fragment identifiers. It targets the actual file path that must exist on disk.

### Ignoring External Schemes in `check_internal_links()`

Inside `check_internal_links()` — whose signature is `def check_internal_links(audit: Audit, lesson: Path, text: str) -> None` — the script first deduplicates hrefs using a `seen` set, then discards any link that begins with `http://`, `https://`, `mailto:`, or `data:`. These schemes point to external resources and are therefore outside the scope of internal filesystem validation.

## Resolving Targets and Enforcing Existence

After filtering, the remaining hrefs are mapped to concrete filesystem paths using repository‑specific resolution rules.

### Absolute vs. Relative Path Resolution

The script handles two path styles:

- **Absolute repo‑root paths** — If the href starts with `/`, it is resolved relative to `ROOT` via `ROOT / href.lstrip("/")`.
- **Relative paths** — Otherwise, the path is treated as relative to the `docs` folder using `(doc.parent / href).resolve()`.

This dual‑resolution strategy allows curriculum authors to link from [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md) to sibling assets or to reference top‑level curriculum files from anywhere in the repository.

### Raising Rule `L010` for Broken Links

The full `check_internal_links()` implementation ties extraction, filtering, and resolution together:

```python
def check_internal_links(audit: Audit, lesson: Path, text: str) -> None:
    """Validate that every internal Markdown link resolves to an existing file."""
    doc = lesson / "docs" / "en.md"
    seen: set[str] = set()
    for match in MD_LINK_RE.finditer(text):
        href = match.group(1).strip()
        if href in seen:
            continue
        seen.add(href)

        # Skip external schemes

        if href.startswith(("http://", "https://", "mailto:", "data:")):
            continue

        # Resolve absolute repo-root paths or relative paths

        if href.startswith("/"):
            target = ROOT / href.lstrip("/")
        else:
            target = (doc.parent / href).resolve()

        # Report a broken internal link

        if not target.exists():
            audit.add(
                "L010",
                lesson,
                doc,
                f"internal link does not resolve: {href!r}"
            )

```

The final gate is a simple existence check. The script calls `target.exists()`, and if the resolved path is missing, it records an `Issue` with rule **L010**: "internal link does not resolve". This message is emitted alongside the broken href and source file path, giving contributors an exact pointer to the problem.

## Running the Audit Locally and in CI

### Command‑Line Usage

You can execute the script manually to validate links before pushing:

```bash
python3 scripts/audit_lessons.py          # checks all lessons

python3 scripts/audit_lessons.py --phase 14   # only phase 14

python3 scripts/audit_lessons.py --json        # machine-readable output

```

### GitHub Actions Enforcement

According to the curriculum source code, the `audit` job inside [`.github/workflows/curriculum.yml`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/.github/workflows/curriculum.yml) runs this script on every build. This prevents dead internal links from ever being merged into the main branch, locking in documentation quality at the CI layer.

## Summary

- The `rohitg00/ai-engineering-from-scratch` curriculum uses **[`scripts/audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/audit_lessons.py)** to validate internal links in lesson documentation.
- **`iter_lesson_dirs()`** and **`check_docs_en_md()`** discover and load every [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md) file.
- **`MD_LINK_RE`** extracts Markdown hrefs, while **`check_internal_links()`** filters external schemes and resolves paths.
- Absolute paths are mapped to `ROOT`; relative paths are resolved against the `docs` parent directory.
- Missing targets trigger **`L010`**: "internal link does not resolve".
- The audit runs locally via CLI and automatically in CI through [`.github/workflows/curriculum.yml`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/.github/workflows/curriculum.yml).

## Frequently Asked Questions

### What file contains the core logic for validating internal links in lesson documentation?

The core logic lives in [`scripts/audit_lessons.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/audit_lessons.py) inside the `rohitg00/ai-engineering-from-scratch` repository. Lines 96–112 contain the href extraction, path resolution, and existence check that together define the link validation behavior.

### How does the audit distinguish between internal and external links?

Inside `check_internal_links()`, the script explicitly skips any href that starts with `http://`, `https://`, `mailto:`, or `data:`. Everything else is treated as an internal filesystem reference and resolved against the repository.

### What happens when an internal link points to a missing file?

When `target.exists()` returns `False`, the audit records an issue under rule `L010` with the message "internal link does not resolve". This issue includes the exact href and the source file path, appearing in both human‑readable and JSON report formats.

### Can I run the link check on a single phase instead of the entire curriculum?

Yes. Pass the `--phase` flag followed by the phase number. For example, `python3 scripts/audit_lessons.py --phase 14` limits the audit to phase 14 only, which is useful for targeted debugging during content updates.