how-to-guide

Artifact Discovery and Frontmatter Parsing in install_skills.py: A Complete Guide

June 9, 2026 rohitg00/ai-engineering-from-scratch ↗

The install_skills.py script automates curriculum installation by scanning phases/**/outputs/ for Markdown files, extracting their YAML frontmatter via parse_frontmatter(), and converting each into a typed Artifact dataclass without requiring external dependencies.

The rohitg00/ai-engineering-from-scratch repository provides a lightweight toolchain for packaging AI engineering educational content. At its core, the artifact discovery and frontmatter parsing pipeline transforms Markdown-based skills, prompts, and agents into installable components. This article examines how scripts/install_skills.py locates curriculum files and parses their metadata using only Python standard library utilities.

Artifact Discovery Mechanism

The discovery phase is implemented by discover_artifacts() in scripts/install_skills.py (lines 91-137). This function walks the repository's phases/ directory structure to identify candidate Markdown files for installation.

Directory Traversal and File Filtering

The function targets the specific glob pattern */[0-9][0-9]-*/outputs under the PHASES_DIR constant (the repository root's phases folder). It filters for .md files exclusively, ignoring all other extensions.

def discover_artifacts() -> Iterable[Artifact]:
    if not PHASES_DIR.is_dir():
        return
    for output_dir in sorted(PHASES_DIR.glob("*/[0-9][0-9]-*/outputs")):
        for path in sorted(output_dir.iterdir()):
            if path.suffix != ".md" or not path.is_file():
                continue
            # Type detection and frontmatter parsing follow...

Filename-Based Type Detection

After locating a Markdown file, the script determines the artifact category by inspecting the filename prefix. Valid prefixes include:

skill- → type skill
prompt- → type prompt
agent- → type agent

If the frontmatter lacks explicit phase or lesson fields, the script calls derive_phase_lesson() to extract default values from the directory path components.

Artifact Initialization

Each discovered file becomes an Artifact dataclass instance containing the fields: type, name, phase, lesson, version, description, tags, and source. The file content is read and passed to parse_frontmatter() for metadata extraction.

Frontmatter Parsing Implementation

The YAML parsing logic lives in scripts/_lib.py (lines 12-60) within the parse_frontmatter() function. This implementation operates without third-party libraries like PyYAML, making the installation script dependency-free.

The parse_frontmatter() Function

The parser expects a YAML frontmatter block delimited by --- on its own line at the start of the file. It extracts the content between the opening and closing delimiters and processes it line-by-line.

def parse_frontmatter(text: str) -> dict[str, object] | None:
    """Parse a YAML‑subset front‑matter block at the top of a markdown string."""
    if not text.startswith("---\n"):
        return None
    end = text.find("\n---\n", 4)
    if end == -1 and text.endswith("\n---"):
        end = len(text) - 4
    if end == -1:
        return None
    block = text[4:end].strip("\n")
    result: dict[str, object] = {}
    for raw in block.splitlines():
        if not raw or raw.startswith("#") or raw[0] in (" ", "\t"):
            continue
        if ":" not in raw:
            continue
        key, _, value = raw.partition(":")
        key = key.strip()
        value = value.strip()
        # List and string parsing logic...

    return result

Supported YAML Constructs

The parser handles a strict subset of YAML syntax required for curriculum metadata:

Bare strings: description: Some text here
Quoted strings: name: "Advanced RAG" or version: '1.0'
Inline lists: tags: [python, rag, "advanced topics"]
Comments: Lines starting with # are ignored

Return Behavior

The function returns a dictionary mapping keys to parsed values, or None if the file lacks valid frontmatter delimiters. This dictionary is then used to populate the Artifact dataclass fields during the discovery phase.

Integration Pipeline

The two mechanisms operate sequentially to build the installation manifest:

Discovery: discover_artifacts() locates all .md files under phases/**/outputs/
Parsing: Each file's text is passed to parse_frontmatter() to extract metadata fields like name, description, version, and tags
Enrichment: Missing phase or lesson values are derived from the file path using derive_phase_lesson()
Instantiation: The populated Artifact objects are filtered by type and copied to the target directory, with entries recorded in manifest.json

Working with the Installation Tool

Listing Skill Artifacts (Dry Run)

Execute the following command to preview which skills would be installed without copying files:

python3 scripts/install_skills.py /tmp/output --type skill --dry-run

This invokes the discovery and parsing pipeline, printing a summary of matched artifacts and their metadata.

Direct Frontmatter Access

You can import the parser independently to inspect curriculum metadata:

from pathlib import Path
from scripts._lib import parse_frontmatter

md_text = Path("phases/01-foundations/01-intro/outputs/skill-intro.md").read_text()
meta = parse_frontmatter(md_text)

print(meta.get("name"))          # → "Intro Skill"

print(meta.get("tags", []))      # → ["foundation", "basics"]

Summary

discover_artifacts() in scripts/install_skills.py walks the phases/ directory tree, filtering for Markdown files with specific filename prefixes (skill-, prompt-, agent-)
parse_frontmatter() in scripts/_lib.py provides a lightweight, zero-dependency YAML parser supporting strings, lists, and comments
The Artifact dataclass standardizes metadata fields including type, version, tags, and lesson associations
Path-derived defaults ensure robust operation even when frontmatter lacks explicit phase or lesson declarations
The entire pipeline operates without external Python packages, relying solely on the standard library

Frequently Asked Questions

What file pattern does discover_artifacts() search for?

The function searches for directories matching phases/*/[0-9][0-9]-*/outputs and considers only files ending in .md. This structure corresponds to the curriculum's phase-lesson organization, where two-digit prefixes indicate ordering.

Which YAML constructs does parse_frontmatter() support?

The parser supports bare strings, single and double-quoted strings, comma-separated inline lists enclosed in brackets, and comment lines beginning with #. It does not support nested dictionaries, multi-line strings, or complex YAML anchors.

How does the script handle missing metadata fields?

If the frontmatter dictionary lacks phase or lesson keys, install_skills.py invokes derive_phase_lesson() to infer these values from the file's directory path. This ensures every Artifact instance has complete metadata regardless of frontmatter completeness.

What types of artifacts can be discovered?

The discovery mechanism recognizes three artifact types based on filename prefixes: skills (files starting with skill-), prompts (files starting with prompt-), and agents (files starting with agent-). Each type is handled identically during parsing but may be filtered separately during installation.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how rohitg00/ai-engineering-from-scratch works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →