How the AI Engineering from Scratch Curriculum Validates Internal Links in Lesson Documentation

The AI‑Engineering‑from‑Scratch curriculum validates internal links in lesson documentation by executing scripts/audit_lessons.py, which scans every lesson's docs/en.md, resolves each Markdown href against the repository filesystem, and raises rule L010 whenever a target path does not exist.

Maintaining reliable navigation across dozens of lesson files requires automated checks. The rohitg00/ai-engineering-from-scratch repository tackles this with a purpose‑built audit script that validates internal links in lesson documentation before changes ever reach learners. As implemented in the project source code, the pipeline parses Markdown cross‑references, distinguishes external URLs from internal paths, and enforces link integrity on every CI build.

The Entry Point: scripts/audit_lessons.py

The entire validation workflow is centralized in scripts/audit_lessons.py. This script is invoked both locally by contributors and automatically in GitHub Actions via the audit job defined in .github/workflows/curriculum.yml. The curriculum's operational rules, including this audit workflow, are outlined in AGENTS.md, while README.md serves as the repository entry point that references the automated CI pipeline.

Walking the Curriculum Structure with iter_lesson_dirs()

The audit begins by discovering content. The iter_lesson_dirs() function walks every phase directory matching the pattern phases/NN-.../ and yields each individual lesson folder. This guarantees that no lesson is skipped during the scan.

Loading Lesson Content via check_docs_en_md()

For each discovered lesson, check_docs_en_md() loads the raw Markdown text from docs/en.md. This file serves as the canonical English‑language lesson document and is the single source of truth for link validation.

Once the raw text is in memory, the script isolates every hyperlink and strips away external dependencies.

The script compiles a regular expression named MD_LINK_RE to extract the destination of each Markdown link:

MD_LINK_RE = re.compile(r"\[[^\]]*\]\(([^)\s#]+)(?:#[^)]*)?\)")

This pattern captures the href portion of []() syntax while ignoring inline # fragment identifiers. It targets the actual file path that must exist on disk.

Inside check_internal_links() — whose signature is def check_internal_links(audit: Audit, lesson: Path, text: str) -> None — the script first deduplicates hrefs using a seen set, then discards any link that begins with http://, https://, mailto:, or data:. These schemes point to external resources and are therefore outside the scope of internal filesystem validation.

Resolving Targets and Enforcing Existence

After filtering, the remaining hrefs are mapped to concrete filesystem paths using repository‑specific resolution rules.

Absolute vs. Relative Path Resolution

The script handles two path styles:

  • Absolute repo‑root paths — If the href starts with /, it is resolved relative to ROOT via ROOT / href.lstrip("/").
  • Relative paths — Otherwise, the path is treated as relative to the docs folder using (doc.parent / href).resolve().

This dual‑resolution strategy allows curriculum authors to link from docs/en.md to sibling assets or to reference top‑level curriculum files from anywhere in the repository.

The full check_internal_links() implementation ties extraction, filtering, and resolution together:

def check_internal_links(audit: Audit, lesson: Path, text: str) -> None:
    """Validate that every internal Markdown link resolves to an existing file."""
    doc = lesson / "docs" / "en.md"
    seen: set[str] = set()
    for match in MD_LINK_RE.finditer(text):
        href = match.group(1).strip()
        if href in seen:
            continue
        seen.add(href)

        # Skip external schemes

        if href.startswith(("http://", "https://", "mailto:", "data:")):
            continue

        # Resolve absolute repo-root paths or relative paths

        if href.startswith("/"):
            target = ROOT / href.lstrip("/")
        else:
            target = (doc.parent / href).resolve()

        # Report a broken internal link

        if not target.exists():
            audit.add(
                "L010",
                lesson,
                doc,
                f"internal link does not resolve: {href!r}"
            )

The final gate is a simple existence check. The script calls target.exists(), and if the resolved path is missing, it records an Issue with rule L010: "internal link does not resolve". This message is emitted alongside the broken href and source file path, giving contributors an exact pointer to the problem.

Running the Audit Locally and in CI

Command‑Line Usage

You can execute the script manually to validate links before pushing:

python3 scripts/audit_lessons.py          # checks all lessons

python3 scripts/audit_lessons.py --phase 14   # only phase 14

python3 scripts/audit_lessons.py --json        # machine-readable output

GitHub Actions Enforcement

According to the curriculum source code, the audit job inside .github/workflows/curriculum.yml runs this script on every build. This prevents dead internal links from ever being merged into the main branch, locking in documentation quality at the CI layer.

Summary

  • The rohitg00/ai-engineering-from-scratch curriculum uses scripts/audit_lessons.py to validate internal links in lesson documentation.
  • iter_lesson_dirs() and check_docs_en_md() discover and load every docs/en.md file.
  • MD_LINK_RE extracts Markdown hrefs, while check_internal_links() filters external schemes and resolves paths.
  • Absolute paths are mapped to ROOT; relative paths are resolved against the docs parent directory.
  • Missing targets trigger L010: "internal link does not resolve".
  • The audit runs locally via CLI and automatically in CI through .github/workflows/curriculum.yml.

Frequently Asked Questions

The core logic lives in scripts/audit_lessons.py inside the rohitg00/ai-engineering-from-scratch repository. Lines 96–112 contain the href extraction, path resolution, and existence check that together define the link validation behavior.

Inside check_internal_links(), the script explicitly skips any href that starts with http://, https://, mailto:, or data:. Everything else is treated as an internal filesystem reference and resolved against the repository.

When target.exists() returns False, the audit records an issue under rule L010 with the message "internal link does not resolve". This issue includes the exact href and the source file path, appearing in both human‑readable and JSON report formats.

Yes. Pass the --phase flag followed by the phase number. For example, python3 scripts/audit_lessons.py --phase 14 limits the audit to phase 14 only, which is useful for targeted debugging during content updates.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →