Artifact Discovery and Frontmatter Parsing in install_skills.py: A Complete Guide
The install_skills.py script automates curriculum installation by scanning phases/**/outputs/ for Markdown files, extracting their YAML frontmatter via parse_frontmatter(), and converting each into a typed Artifact dataclass without requiring external dependencies.
The rohitg00/ai-engineering-from-scratch repository provides a lightweight toolchain for packaging AI engineering educational content. At its core, the artifact discovery and frontmatter parsing pipeline transforms Markdown-based skills, prompts, and agents into installable components. This article examines how scripts/install_skills.py locates curriculum files and parses their metadata using only Python standard library utilities.
Artifact Discovery Mechanism
The discovery phase is implemented by discover_artifacts() in scripts/install_skills.py (lines 91-137). This function walks the repository's phases/ directory structure to identify candidate Markdown files for installation.
Directory Traversal and File Filtering
The function targets the specific glob pattern */[0-9][0-9]-*/outputs under the PHASES_DIR constant (the repository root's phases folder). It filters for .md files exclusively, ignoring all other extensions.
def discover_artifacts() -> Iterable[Artifact]:
if not PHASES_DIR.is_dir():
return
for output_dir in sorted(PHASES_DIR.glob("*/[0-9][0-9]-*/outputs")):
for path in sorted(output_dir.iterdir()):
if path.suffix != ".md" or not path.is_file():
continue
# Type detection and frontmatter parsing follow...
Filename-Based Type Detection
After locating a Markdown file, the script determines the artifact category by inspecting the filename prefix. Valid prefixes include:
skill-→ typeskillprompt-→ typepromptagent-→ typeagent
If the frontmatter lacks explicit phase or lesson fields, the script calls derive_phase_lesson() to extract default values from the directory path components.
Artifact Initialization
Each discovered file becomes an Artifact dataclass instance containing the fields: type, name, phase, lesson, version, description, tags, and source. The file content is read and passed to parse_frontmatter() for metadata extraction.
Frontmatter Parsing Implementation
The YAML parsing logic lives in scripts/_lib.py (lines 12-60) within the parse_frontmatter() function. This implementation operates without third-party libraries like PyYAML, making the installation script dependency-free.
The parse_frontmatter() Function
The parser expects a YAML frontmatter block delimited by --- on its own line at the start of the file. It extracts the content between the opening and closing delimiters and processes it line-by-line.
def parse_frontmatter(text: str) -> dict[str, object] | None:
"""Parse a YAML‑subset front‑matter block at the top of a markdown string."""
if not text.startswith("---\n"):
return None
end = text.find("\n---\n", 4)
if end == -1 and text.endswith("\n---"):
end = len(text) - 4
if end == -1:
return None
block = text[4:end].strip("\n")
result: dict[str, object] = {}
for raw in block.splitlines():
if not raw or raw.startswith("#") or raw[0] in (" ", "\t"):
continue
if ":" not in raw:
continue
key, _, value = raw.partition(":")
key = key.strip()
value = value.strip()
# List and string parsing logic...
return result
Supported YAML Constructs
The parser handles a strict subset of YAML syntax required for curriculum metadata:
- Bare strings:
description: Some text here - Quoted strings:
name: "Advanced RAG"orversion: '1.0' - Inline lists:
tags: [python, rag, "advanced topics"] - Comments: Lines starting with
#are ignored
Return Behavior
The function returns a dictionary mapping keys to parsed values, or None if the file lacks valid frontmatter delimiters. This dictionary is then used to populate the Artifact dataclass fields during the discovery phase.
Integration Pipeline
The two mechanisms operate sequentially to build the installation manifest:
- Discovery:
discover_artifacts()locates all.mdfiles underphases/**/outputs/ - Parsing: Each file's text is passed to
parse_frontmatter()to extract metadata fields likename,description,version, andtags - Enrichment: Missing
phaseorlessonvalues are derived from the file path usingderive_phase_lesson() - Instantiation: The populated
Artifactobjects are filtered by type and copied to the target directory, with entries recorded inmanifest.json
Working with the Installation Tool
Listing Skill Artifacts (Dry Run)
Execute the following command to preview which skills would be installed without copying files:
python3 scripts/install_skills.py /tmp/output --type skill --dry-run
This invokes the discovery and parsing pipeline, printing a summary of matched artifacts and their metadata.
Direct Frontmatter Access
You can import the parser independently to inspect curriculum metadata:
from pathlib import Path
from scripts._lib import parse_frontmatter
md_text = Path("phases/01-foundations/01-intro/outputs/skill-intro.md").read_text()
meta = parse_frontmatter(md_text)
print(meta.get("name")) # → "Intro Skill"
print(meta.get("tags", [])) # → ["foundation", "basics"]
Summary
discover_artifacts()inscripts/install_skills.pywalks thephases/directory tree, filtering for Markdown files with specific filename prefixes (skill-,prompt-,agent-)parse_frontmatter()inscripts/_lib.pyprovides a lightweight, zero-dependency YAML parser supporting strings, lists, and comments- The Artifact dataclass standardizes metadata fields including type, version, tags, and lesson associations
- Path-derived defaults ensure robust operation even when frontmatter lacks explicit phase or lesson declarations
- The entire pipeline operates without external Python packages, relying solely on the standard library
Frequently Asked Questions
What file pattern does discover_artifacts() search for?
The function searches for directories matching phases/*/[0-9][0-9]-*/outputs and considers only files ending in .md. This structure corresponds to the curriculum's phase-lesson organization, where two-digit prefixes indicate ordering.
Which YAML constructs does parse_frontmatter() support?
The parser supports bare strings, single and double-quoted strings, comma-separated inline lists enclosed in brackets, and comment lines beginning with #. It does not support nested dictionaries, multi-line strings, or complex YAML anchors.
How does the script handle missing metadata fields?
If the frontmatter dictionary lacks phase or lesson keys, install_skills.py invokes derive_phase_lesson() to infer these values from the file's directory path. This ensures every Artifact instance has complete metadata regardless of frontmatter completeness.
What types of artifacts can be discovered?
The discovery mechanism recognizes three artifact types based on filename prefixes: skills (files starting with skill-), prompts (files starting with prompt-), and agents (files starting with agent-). Each type is handled identically during parsing but may be filtered separately during installation.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →