Evaluation Metrics for AI Models in ai-engineering-from-scratch: From BLEU to Calibration Error

The ai-engineering-from-scratch repository implements a full-stack evaluation suite for AI models—including classical metrics (exact-match, F1, BLEU-4, ROUGE-L), safety-gate metrics (refusal rates, prompt-injection detection), and calibration metrics (ECE)—all built from first principles and orchestrated through a unified metric dispatcher.

The ai-engineering-from-scratch curriculum treats evaluation as a first-class engineering concern rather than an afterthought. Every evaluation metric for AI models is implemented from scratch using only the Python standard library and NumPy, exposing the mathematical definitions and algorithmic details that library black-boxes usually hide.

Core Classical Metrics Implemented from First Principles

The foundation of the evaluation suite lives in Capstone Lesson 71. In phases/19-capstone-projects/71-classical-metrics/code/main.py, the repository defines exact-match, accuracy, F1-score, BLEU-4, and ROUGE-L without relying on external NLP libraries.

Exact-Match and Accuracy

Exact-match measures binary string equality between a prediction and any reference target. The implementation performs a simple case-sensitive comparison after whitespace stripping: prediction.strip() == target.strip(). Accuracy is provided as a direct alias of exact-match for single-label classification tasks, reusing the same core logic.

F1-Score

F1-score computes the harmonic mean of token-level precision and recall. The implementation tokenizes strings with the \w+ regular-expression pattern and counts token overlaps via Python’s collections.Counter. It explicitly guards against empty-prediction and empty-target edge cases to avoid division-by-zero errors.

BLEU-4

BLEU-4 evaluates corpus-level modified n-gram precision across n-gram orders 1 through 4, augmented by a brevity penalty and additive-one smoothing. The source code defines helper functions _ngram_counts, _modified_precision, and _brevity_penalty. Smoothing adds 1 to both the numerator and denominator for each n-gram order, preventing zero scores when higher-order n-grams are missing. A dedicated corpus_bleu(preds, refs) function aggregates scores across multiple sentence pairs.

ROUGE-L

ROUGE-L measures similarity through Longest Common Subsequence (LCS). The implementation builds a dynamic-programming table as a NumPy array to compute LCS length, then derives precision, recall, and an F-beta score with β = 1. This approach captures sentence-level structure without requiring exact contiguous n-gram matches.

Unified Metric Dispatcher

Rather than hard-coding metric calls throughout the pipeline, the repository exposes a single metric dispatcher named score(metric_name, prediction, targets). This function returns a normalized floating-point value in the range [0, 1]. Downstream runner lessons—such as the end-to-end evaluation runner in phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py—invoke this dispatcher to remain metric-agnostic.

Safety and Calibration Metrics

Beyond text-generation quality, the curriculum evaluates model behavior, robustness, and confidence calibration through specialized capstone lessons.

Expected Calibration Error (ECE)

The Expected Calibration Error (ECE) quantifies the gap between a model’s stated confidence and its empirical accuracy across confidence bins. Defined in phases/19-capstone-projects/73-perplexity-calibration/docs/en.md, this metric helps learners diagnose over-confident or under-confident predictions. Reliability-diagram data is generated by binning predictions and comparing average confidence against observed accuracy.

Prompt-Injection Detection Metrics

The prompt-injection detector lesson, documented in phases/19-capstone-projects/83-prompt-injection-detector/docs/en.md, uses per-category precision, recall, and F1 to evaluate safety-gate performance. The implementation tracks true-positive, false-positive, true-negative, and false-negative counts for each taxonomy category, allowing granular analysis of which attack vectors the detector catches or misses.

Refusal Evaluation Metrics

The refusal-evaluation framework, outlined in phases/19-capstone-projects/84-refusal-evaluation/docs/en.md, introduces four key behavioral metrics:

  • Under-refusal: The rate at which the model incorrectly answers harmful prompts that should have been refused.
  • Over-refusal: The rate at which the model refuses benign prompts that should have been answered.
  • Overall accuracy: The proportion of prompts (harmful and benign) that receive the correct gated response.
  • Calibration (ECE): The calibration error of the safety gate’s confidence scores.

Practical Code Examples

The following snippets demonstrate how to invoke the metrics layer directly from the repository source.

Computing a Single Metric

from phases_19_capstone_projects_71_classical_metrics.code.main import score

prediction = "the cat sat on the mat"
targets    = ["the cat sat on the mat", "a cat sat on the mat"]
bleu = score("bleu_4", prediction, targets)
print(f"BLEU-4: {bleu:.3f}")

Corpus-Level BLEU

from phases_19_capstone_projects_71_classical_metrics.code.main import corpus_bleu

preds = ["the cat sat on the mat", "the runner won the race"]
refs  = ["the cat sat on the mat", "the runner crossed the finish line first"]
print("Corpus BLEU-4:", corpus_bleu(preds, refs))

Evaluating Refusal Behavior


# Assuming the refusal-evaluation runner has been imported as `refusal`

result = refusal.evaluate(prompts, model_outputs)
print("Under-refusal:", result["under_refusal"])
print("Over-refusal :", result["over_refusal"])
print("Accuracy    :", result["accuracy"])
print("ECE         :", result["ece"])

Summary

  • The classical metrics in phases/19-capstone-projects/71-classical-metrics/code/main.py provide exact-match, accuracy, F1, BLEU-4, and ROUGE-L implementations built from scratch with stdlib and NumPy.
  • A unified score(metric_name, prediction, targets) dispatcher decouples metric definitions from downstream evaluation runners.
  • Safety metrics include per-category precision/recall/F1 for prompt-injection detection and under-refusal/over-refusal rates for safety-gate evaluation.
  • Calibration is measured via Expected Calibration Error (ECE) across confidence bins.
  • Together, these components form a full-stack, metric-agnostic evaluation pipeline for language-model-based AI systems.

Frequently Asked Questions

What evaluation metrics for AI models are implemented in the classical metrics lesson?

Capstone Lesson 71 implements exact-match, accuracy, F1-score, BLEU-4, and ROUGE-L in phases/19-capstone-projects/71-classical-metrics/code/main.py. Each metric is coded from first principles using Python’s standard library and NumPy rather than external NLP toolkits.

How does the repository calculate BLEU-4 without external libraries?

The source code defines internal helpers _ngram_counts, _modified_precision, and _brevity_penalty to compute modified n-gram precision for orders 1 through 4, applies additive-one smoothing, and penalizes short translations. A standalone corpus_bleu(preds, refs) function aggregates the score across multiple predictions and reference sets.

Which safety metrics does ai-engineering-from-scratch use for prompt-injection and refusal evaluation?

For prompt-injection detection, the curriculum uses per-category precision, recall, and F1 based on true-positive, false-positive, true-negative, and false-negative counts. For refusal evaluation, it tracks under-refusal, over-refusal, overall accuracy, and the Expected Calibration Error (ECE) of the safety gate.

How does the metric dispatcher simplify the evaluation pipeline?

The score(metric_name, prediction, targets) dispatcher returns a float in [0, 1] for any supported metric. By routing all metric calls through this single interface, downstream evaluation runners—such as the end-to-end runner in phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py—remain agnostic to the specific metric being computed.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →