# Artifact Discovery and Frontmatter Parsing in install_skills.py: A Complete Guide

> Learn artifact discovery and frontmatter parsing in install_skills.py. This guide explains how the script extracts YAML frontmatter from Markdown files to create typed Artifacts without dependencies.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: how-to-guide
- Published: 2026-06-09

---

**The [`install_skills.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/install_skills.py) script automates curriculum installation by scanning `phases/**/outputs/` for Markdown files, extracting their YAML frontmatter via `parse_frontmatter()`, and converting each into a typed `Artifact` dataclass without requiring external dependencies.**

The `rohitg00/ai-engineering-from-scratch` repository provides a lightweight toolchain for packaging AI engineering educational content. At its core, the **artifact discovery and frontmatter parsing** pipeline transforms Markdown-based skills, prompts, and agents into installable components. This article examines how [`scripts/install_skills.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/install_skills.py) locates curriculum files and parses their metadata using only Python standard library utilities.

## Artifact Discovery Mechanism

The discovery phase is implemented by `discover_artifacts()` in [`scripts/install_skills.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/install_skills.py) (lines 91-137). This function walks the repository's `phases/` directory structure to identify candidate Markdown files for installation.

### Directory Traversal and File Filtering

The function targets the specific glob pattern `*/[0-9][0-9]-*/outputs` under the `PHASES_DIR` constant (the repository root's `phases` folder). It filters for `.md` files exclusively, ignoring all other extensions.

```python
def discover_artifacts() -> Iterable[Artifact]:
    if not PHASES_DIR.is_dir():
        return
    for output_dir in sorted(PHASES_DIR.glob("*/[0-9][0-9]-*/outputs")):
        for path in sorted(output_dir.iterdir()):
            if path.suffix != ".md" or not path.is_file():
                continue
            # Type detection and frontmatter parsing follow...

```

### Filename-Based Type Detection

After locating a Markdown file, the script determines the artifact category by inspecting the filename prefix. Valid prefixes include:
- **`skill-`** → type `skill`
- **`prompt-`** → type `prompt`
- **`agent-`** → type `agent`

If the frontmatter lacks explicit `phase` or `lesson` fields, the script calls `derive_phase_lesson()` to extract default values from the directory path components.

### Artifact Initialization

Each discovered file becomes an `Artifact` dataclass instance containing the fields: `type`, `name`, `phase`, `lesson`, `version`, `description`, `tags`, and `source`. The file content is read and passed to `parse_frontmatter()` for metadata extraction.

## Frontmatter Parsing Implementation

The YAML parsing logic lives in [`scripts/_lib.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/_lib.py) (lines 12-60) within the `parse_frontmatter()` function. This implementation operates without third-party libraries like PyYAML, making the installation script dependency-free.

### The parse_frontmatter() Function

The parser expects a YAML frontmatter block delimited by `---` on its own line at the start of the file. It extracts the content between the opening and closing delimiters and processes it line-by-line.

```python
def parse_frontmatter(text: str) -> dict[str, object] | None:
    """Parse a YAML‑subset front‑matter block at the top of a markdown string."""
    if not text.startswith("---\n"):
        return None
    end = text.find("\n---\n", 4)
    if end == -1 and text.endswith("\n---"):
        end = len(text) - 4
    if end == -1:
        return None
    block = text[4:end].strip("\n")
    result: dict[str, object] = {}
    for raw in block.splitlines():
        if not raw or raw.startswith("#") or raw[0] in (" ", "\t"):
            continue
        if ":" not in raw:
            continue
        key, _, value = raw.partition(":")
        key = key.strip()
        value = value.strip()
        # List and string parsing logic...

    return result

```

### Supported YAML Constructs

The parser handles a strict subset of YAML syntax required for curriculum metadata:
- **Bare strings**: `description: Some text here`
- **Quoted strings**: `name: "Advanced RAG"` or `version: '1.0'`
- **Inline lists**: `tags: [python, rag, "advanced topics"]`
- **Comments**: Lines starting with `#` are ignored

### Return Behavior

The function returns a dictionary mapping keys to parsed values, or `None` if the file lacks valid frontmatter delimiters. This dictionary is then used to populate the `Artifact` dataclass fields during the discovery phase.

## Integration Pipeline

The two mechanisms operate sequentially to build the installation manifest:

1. **Discovery**: `discover_artifacts()` locates all `.md` files under `phases/**/outputs/`
2. **Parsing**: Each file's text is passed to `parse_frontmatter()` to extract metadata fields like `name`, `description`, `version`, and `tags`
3. **Enrichment**: Missing `phase` or `lesson` values are derived from the file path using `derive_phase_lesson()`
4. **Instantiation**: The populated `Artifact` objects are filtered by type and copied to the target directory, with entries recorded in [`manifest.json`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/manifest.json)

## Working with the Installation Tool

### Listing Skill Artifacts (Dry Run)

Execute the following command to preview which skills would be installed without copying files:

```bash
python3 scripts/install_skills.py /tmp/output --type skill --dry-run

```

This invokes the discovery and parsing pipeline, printing a summary of matched artifacts and their metadata.

### Direct Frontmatter Access

You can import the parser independently to inspect curriculum metadata:

```python
from pathlib import Path
from scripts._lib import parse_frontmatter

md_text = Path("phases/01-foundations/01-intro/outputs/skill-intro.md").read_text()
meta = parse_frontmatter(md_text)

print(meta.get("name"))          # → "Intro Skill"

print(meta.get("tags", []))      # → ["foundation", "basics"]

```

## Summary

- **`discover_artifacts()`** in [`scripts/install_skills.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/install_skills.py) walks the `phases/` directory tree, filtering for Markdown files with specific filename prefixes (`skill-`, `prompt-`, `agent-`)
- **`parse_frontmatter()`** in [`scripts/_lib.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/scripts/_lib.py) provides a lightweight, zero-dependency YAML parser supporting strings, lists, and comments
- The **Artifact dataclass** standardizes metadata fields including type, version, tags, and lesson associations
- Path-derived defaults ensure robust operation even when frontmatter lacks explicit phase or lesson declarations
- The entire pipeline operates without external Python packages, relying solely on the standard library

## Frequently Asked Questions

### What file pattern does discover_artifacts() search for?

The function searches for directories matching `phases/*/[0-9][0-9]-*/outputs` and considers only files ending in `.md`. This structure corresponds to the curriculum's phase-lesson organization, where two-digit prefixes indicate ordering.

### Which YAML constructs does parse_frontmatter() support?

The parser supports bare strings, single and double-quoted strings, comma-separated inline lists enclosed in brackets, and comment lines beginning with `#`. It does not support nested dictionaries, multi-line strings, or complex YAML anchors.

### How does the script handle missing metadata fields?

If the frontmatter dictionary lacks `phase` or `lesson` keys, [`install_skills.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/install_skills.py) invokes `derive_phase_lesson()` to infer these values from the file's directory path. This ensures every `Artifact` instance has complete metadata regardless of frontmatter completeness.

### What types of artifacts can be discovered?

The discovery mechanism recognizes three artifact types based on filename prefixes: **skills** (files starting with `skill-`), **prompts** (files starting with `prompt-`), and **agents** (files starting with `agent-`). Each type is handled identically during parsing but may be filtered separately during installation.