# Evaluation Metrics for AI Models in ai-engineering-from-scratch: From BLEU to Calibration Error

> Explore AI model evaluation metrics in ai-engineering-from-scratch. Discover BLEU, ROUGE, calibration error, and more built from first principles.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: tutorial
- Published: 2026-06-06

---

**The ai-engineering-from-scratch repository implements a full-stack evaluation suite for AI models—including classical metrics (exact-match, F1, BLEU-4, ROUGE-L), safety-gate metrics (refusal rates, prompt-injection detection), and calibration metrics (ECE)—all built from first principles and orchestrated through a unified metric dispatcher.**

The `ai-engineering-from-scratch` curriculum treats evaluation as a first-class engineering concern rather than an afterthought. Every evaluation metric for AI models is implemented from scratch using only the Python standard library and NumPy, exposing the mathematical definitions and algorithmic details that library black-boxes usually hide.

## Core Classical Metrics Implemented from First Principles

The foundation of the evaluation suite lives in Capstone Lesson 71. In [`phases/19-capstone-projects/71-classical-metrics/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/71-classical-metrics/code/main.py), the repository defines exact-match, accuracy, F1-score, BLEU-4, and ROUGE-L without relying on external NLP libraries.

### Exact-Match and Accuracy

**Exact-match** measures binary string equality between a prediction and any reference target. The implementation performs a simple case-sensitive comparison after whitespace stripping: `prediction.strip() == target.strip()`. **Accuracy** is provided as a direct alias of exact-match for single-label classification tasks, reusing the same core logic.

### F1-Score

**F1-score** computes the harmonic mean of token-level precision and recall. The implementation tokenizes strings with the `\w+` regular-expression pattern and counts token overlaps via Python’s `collections.Counter`. It explicitly guards against empty-prediction and empty-target edge cases to avoid division-by-zero errors.

### BLEU-4

**BLEU-4** evaluates corpus-level modified n-gram precision across n-gram orders 1 through 4, augmented by a brevity penalty and additive-one smoothing. The source code defines helper functions `_ngram_counts`, `_modified_precision`, and `_brevity_penalty`. Smoothing adds 1 to both the numerator and denominator for each n-gram order, preventing zero scores when higher-order n-grams are missing. A dedicated `corpus_bleu(preds, refs)` function aggregates scores across multiple sentence pairs.

### ROUGE-L

**ROUGE-L** measures similarity through Longest Common Subsequence (LCS). The implementation builds a dynamic-programming table as a NumPy array to compute LCS length, then derives precision, recall, and an F-beta score with β = 1. This approach captures sentence-level structure without requiring exact contiguous n-gram matches.

## Unified Metric Dispatcher

Rather than hard-coding metric calls throughout the pipeline, the repository exposes a single **metric dispatcher** named `score(metric_name, prediction, targets)`. This function returns a normalized floating-point value in the range `[0, 1]`. Downstream runner lessons—such as the end-to-end evaluation runner in [`phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py)—invoke this dispatcher to remain metric-agnostic.

## Safety and Calibration Metrics

Beyond text-generation quality, the curriculum evaluates model behavior, robustness, and confidence calibration through specialized capstone lessons.

### Expected Calibration Error (ECE)

The **Expected Calibration Error (ECE)** quantifies the gap between a model’s stated confidence and its empirical accuracy across confidence bins. Defined in [`phases/19-capstone-projects/73-perplexity-calibration/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/73-perplexity-calibration/docs/en.md), this metric helps learners diagnose over-confident or under-confident predictions. Reliability-diagram data is generated by binning predictions and comparing average confidence against observed accuracy.

### Prompt-Injection Detection Metrics

The prompt-injection detector lesson, documented in [`phases/19-capstone-projects/83-prompt-injection-detector/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/83-prompt-injection-detector/docs/en.md), uses **per-category precision, recall, and F1** to evaluate safety-gate performance. The implementation tracks true-positive, false-positive, true-negative, and false-negative counts for each taxonomy category, allowing granular analysis of which attack vectors the detector catches or misses.

### Refusal Evaluation Metrics

The refusal-evaluation framework, outlined in [`phases/19-capstone-projects/84-refusal-evaluation/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/84-refusal-evaluation/docs/en.md), introduces four key behavioral metrics:

- **Under-refusal**: The rate at which the model incorrectly answers harmful prompts that should have been refused.
- **Over-refusal**: The rate at which the model refuses benign prompts that should have been answered.
- **Overall accuracy**: The proportion of prompts (harmful and benign) that receive the correct gated response.
- **Calibration (ECE)**: The calibration error of the safety gate’s confidence scores.

## Practical Code Examples

The following snippets demonstrate how to invoke the metrics layer directly from the repository source.

### Computing a Single Metric

```python
from phases_19_capstone_projects_71_classical_metrics.code.main import score

prediction = "the cat sat on the mat"
targets    = ["the cat sat on the mat", "a cat sat on the mat"]
bleu = score("bleu_4", prediction, targets)
print(f"BLEU-4: {bleu:.3f}")

```

### Corpus-Level BLEU

```python
from phases_19_capstone_projects_71_classical_metrics.code.main import corpus_bleu

preds = ["the cat sat on the mat", "the runner won the race"]
refs  = ["the cat sat on the mat", "the runner crossed the finish line first"]
print("Corpus BLEU-4:", corpus_bleu(preds, refs))

```

### Evaluating Refusal Behavior

```python

# Assuming the refusal-evaluation runner has been imported as `refusal`

result = refusal.evaluate(prompts, model_outputs)
print("Under-refusal:", result["under_refusal"])
print("Over-refusal :", result["over_refusal"])
print("Accuracy    :", result["accuracy"])
print("ECE         :", result["ece"])

```

## Summary

- The **classical metrics** in [`phases/19-capstone-projects/71-classical-metrics/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/71-classical-metrics/code/main.py) provide exact-match, accuracy, F1, BLEU-4, and ROUGE-L implementations built from scratch with stdlib and NumPy.
- A unified `score(metric_name, prediction, targets)` **dispatcher** decouples metric definitions from downstream evaluation runners.
- **Safety metrics** include per-category precision/recall/F1 for prompt-injection detection and under-refusal/over-refusal rates for safety-gate evaluation.
- **Calibration** is measured via Expected Calibration Error (ECE) across confidence bins.
- Together, these components form a full-stack, metric-agnostic evaluation pipeline for language-model-based AI systems.

## Frequently Asked Questions

### What evaluation metrics for AI models are implemented in the classical metrics lesson?

Capstone Lesson 71 implements exact-match, accuracy, F1-score, BLEU-4, and ROUGE-L in [`phases/19-capstone-projects/71-classical-metrics/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/71-classical-metrics/code/main.py). Each metric is coded from first principles using Python’s standard library and NumPy rather than external NLP toolkits.

### How does the repository calculate BLEU-4 without external libraries?

The source code defines internal helpers `_ngram_counts`, `_modified_precision`, and `_brevity_penalty` to compute modified n-gram precision for orders 1 through 4, applies additive-one smoothing, and penalizes short translations. A standalone `corpus_bleu(preds, refs)` function aggregates the score across multiple predictions and reference sets.

### Which safety metrics does ai-engineering-from-scratch use for prompt-injection and refusal evaluation?

For prompt-injection detection, the curriculum uses per-category precision, recall, and F1 based on true-positive, false-positive, true-negative, and false-negative counts. For refusal evaluation, it tracks under-refusal, over-refusal, overall accuracy, and the Expected Calibration Error (ECE) of the safety gate.

### How does the metric dispatcher simplify the evaluation pipeline?

The `score(metric_name, prediction, targets)` dispatcher returns a float in `[0, 1]` for any supported metric. By routing all metric calls through this single interface, downstream evaluation runners—such as the end-to-end runner in [`phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py)—remain agnostic to the specific metric being computed.