# How to Evaluate Image Inputs with the OpenAI Evals API: A Complete Guide

> Learn to evaluate image inputs with the OpenAI Evals API. Get a complete guide to automating end-to-end model evaluation with structured, reproducible scores using LLM-based grading.

- Repository: [OpenAI/openai-cookbook](https://github.com/openai/openai-cookbook)
- Tags: how-to-guide
- Published: 2026-03-02

---

**The OpenAI Cookbook provides a vision-eval harness that automates end-to-end evaluation of image-generation models by combining the Images API with LLM-based rubric grading to produce structured, reproducible scores.**

Evaluating image inputs with the Evals API requires a structured pipeline that generates images, persists artifacts, and judges outputs against defined criteria. The `openai/openai-cookbook` repository ships a complete reference implementation in `examples/evals/imagegen_evals/` that demonstrates how to evaluate DALL-E, GPT-Image, and other vision models programmatically.

## Architecture of the Vision-Eval Harness

The evaluation system is organized into three distinct layers that separate test definition, execution, and grading concerns.

### Layer 1: Test Definition

The foundation of any evaluation is the `TestCase` data class defined in [`examples/evals/imagegen_evals/vision_harness/types.py`](https://github.com/openai/openai-cookbook/blob/main/examples/evals/imagegen_evals/vision_harness/types.py). This class encapsulates the prompt, evaluation criteria, and task type (image generation or editing) for a single evaluation scenario.

```python
from vision_harness.types import TestCase

ui_case = TestCase(
    id="ui_checkout_mockup",
    task_type="image_generation",
    prompt="Generate a high-fidelity mobile checkout screen with a Checkout button...",
    criteria="The image must contain a Checkout button, Place Order CTA, and price display",
)

```

### Layer 2: Runner and Storage

The [`vision_harness/runners.py`](https://github.com/openai/openai-cookbook/blob/main/vision_harness/runners.py) module handles API calls to `client.images.generate()` or `client.images.edit()`, while [`vision_harness/storage.py`](https://github.com/openai/openai-cookbook/blob/main/vision_harness/storage.py) manages deterministic file paths for raw image bytes. Together, these components create `Artifact` objects that normalize outputs for downstream processing.

### Layer 3: Grader and Orchestration

The [`vision_harness/graders.py`](https://github.com/openai/openai-cookbook/blob/main/vision_harness/graders.py) file implements `LLMajRubricGrader`, which uses a judge model (default `gpt-5.2`) to score images against JSON schemas. The [`vision_harness/evaluate.py`](https://github.com/openai/openai-cookbook/blob/main/vision_harness/evaluate.py) module provides the main orchestration loop that wires cases, runs, and graders together.

## Defining Evaluation Cases and Model Runs

Before executing an evaluation, you must define both the test cases and the model configurations to test against.

### Creating TestCase Objects

Each `TestCase` in [`vision_harness/types.py`](https://github.com/openai/openai-cookbook/blob/main/vision_harness/types.py) requires a unique identifier, task type, prompt string, and evaluation criteria. The criteria field typically contains natural language instructions that the judge model will use to assess the generated image.

```python
from vision_harness.types import TestCase, ModelRun

# Define what to test

ui_case = TestCase(
    id="ui_checkout_mockup",
    task_type="image_generation",
    prompt="Generate a high-fidelity mobile checkout screen...",
    criteria="Must include Checkout button, Place Order CTA, and price display",
)

# Configure how to test it

ui_run = ModelRun(
    label="gpt-image-1.5-ui",
    task_type="image_generation",
    params={"model": "gpt-image-1.5", "n": 1, "size": "1024x1024"},
)

```

The `ModelRun` class bundles model-specific parameters including the model name, number of images to generate (`n`), and image dimensions (`size`).

## Converting Images for Judge Model Consumption

When grading generated images, the harness must convert local image files into data URLs that can be sent to the OpenAI chat completions API. The `image_to_data_url` function in [`vision_harness/io.py`](https://github.com/openai/openai-cookbook/blob/main/vision_harness/io.py) handles this conversion by base64-encoding PNG or JPEG files.

```python

# vision_harness/io.py

def image_to_data_url(path: Path) -> str:
    mime = _MIME_BY_SUFFIX.get(path.suffix.lower(), "image/png")
    b64 = base64.b64encode(path.read_bytes()).decode("utf-8")
    return f"data:{mime};base64,{b64}"

```

This helper is invoked internally by `build_generation_judge_content` to construct the multi-modal chat messages that include both the evaluation criteria and the generated image as a data URL.

## Implementing LLM-Based Rubric Grading

The `LLMajRubricGrader` class in [`vision_harness/graders.py`](https://github.com/openai/openai-cookbook/blob/main/vision_harness/graders.py) implements the core judging logic. It sends structured requests to a judge model along with a strict JSON schema that defines required evaluation fields.

### Configuring the Grader

```python
from vision_harness.graders import LLMajRubricGrader, build_generation_judge_content
from vision_harness.types import Score

ui_grader = LLMajRubricGrader(
    key="ui_eval",
    system_prompt="You are an expert UI/UX evaluator...",
    content_builder=build_generation_judge_content,
    judge_model="gpt-5.2",
    json_schema_name="ui_mockup_eval",
    json_schema={
        "type": "object",
        "properties": {
            "verdict": {"type": "string"},
            "instruction_following": {"type": "boolean"},
            "layout_hierarchy": {"type": "number"},
            "in_image_text_rendering": {"type": "boolean"},
            "ui_affordance_rendering": {"type": "number"},
            "reason": {"type": "string"},
        },
        "required": [
            "verdict", "instruction_following", "layout_hierarchy",
            "in_image_text_rendering", "ui_affordance_rendering", "reason"
        ],
        "additionalProperties": False,
    },
    result_parser=lambda data, _: [
        Score(key="instruction_following", value=bool(data["instruction_following"]), reason=""),
        Score(key="layout_hierarchy", value=float(data["layout_hierarchy"]), reason=""),
        Score(key="verdict", value=str(data["verdict"]), reason=data.get("reason", "").strip()),
    ],
)

```

The `result_parser` lambda converts the judge's JSON response into a list of `Score` objects that the harness can aggregate and report.

## Executing the Full Evaluation Pipeline

The `evaluate()` function in [`vision_harness/evaluate.py`](https://github.com/openai/openai-cookbook/blob/main/vision_harness/evaluate.py) orchestrates the complete workflow: running models, storing artifacts, invoking graders, and returning structured results.

```python
from openai import OpenAI
from pathlib import Path
from vision_harness.evaluate import evaluate
from vision_harness.storage import OutputStore

client = OpenAI()  # Reads OPENAI_API_KEY from environment

output_store = OutputStore(root=Path("./tmp_artifacts"))
results = evaluate(
    cases=[ui_case],
    model_runs=[ui_run],
    graders=[ui_grader],
    output_store=output_store,
)

# Access structured scores

for r in results:
    print(r["scores"])

```

According to the openai-cookbook source code, the `evaluate` function automatically handles batching, error handling, and artifact persistence to the `OutputStore` directory.

## Optional OCR Validation for Text-Heavy Images

For evaluations requiring text accuracy verification—such as flyer generation—the harness includes an additional post-processing step. The `extract_text_from_flyer` function sends the generated image to the OpenAI Responses endpoint to extract line-by-line text, then compares it against a required set of strings (`REQUIRED_TEXT`).

This demonstrates how to wire custom validation logic into the standard evaluation pipeline without modifying the core harness code.

## Customizing and Extending the Workflow

The vision-eval harness supports several extension patterns for specialized use cases.

- **Swap prompts and criteria** – Edit the `UI_PROMPT`, `UI_CRITERIA`, or create new `TestCase` objects with domain-specific evaluation rubrics.
- **Add new judges** – Provide alternative `system_prompt` and JSON schema parameters to `LLMajRubricGrader` for different evaluation dimensions (aesthetic quality, brand safety, etc.).
- **Test different image models** – Change the `params["model"]` field in `ModelRun` to evaluate `dall-e-3`, `gpt-image-1.5`, or future vision models.
- **Run batch sweeps** – Pass multiple `ModelRun` objects with distinct parameters to `evaluate()` to compare model variants in a single execution.

## Summary

- The vision-eval harness in `examples/evals/imagegen_evals/` provides a three-layer architecture for **evaluating image inputs with the Evals API**: test definition ([`types.py`](https://github.com/openai/openai-cookbook/blob/main/types.py)), execution ([`runners.py`](https://github.com/openai/openai-cookbook/blob/main/runners.py), [`storage.py`](https://github.com/openai/openai-cookbook/blob/main/storage.py)), and grading ([`graders.py`](https://github.com/openai/openai-cookbook/blob/main/graders.py)).
- `TestCase` objects encapsulate prompts and criteria, while `ModelRun` configurations specify which models and parameters to evaluate.
- The `image_to_data_url` helper in [`vision_harness/io.py`](https://github.com/openai/openai-cookbook/blob/main/vision_harness/io.py) converts local images to base64 data URLs for multi-modal judge model consumption.
- `LLMajRubricGrader` sends structured JSON schema requests to judge models (default `gpt-5.2`) and parses responses into `Score` objects.
- The `evaluate()` function in [`vision_harness/evaluate.py`](https://github.com/openai/openai-cookbook/blob/main/vision_harness/evaluate.py) orchestrates the complete pipeline from generation to structured scoring, with optional OCR validation for text accuracy.

## Frequently Asked Questions

### What file paths contain the core vision-eval harness implementation?

The core implementation resides in `examples/evals/imagegen_evals/vision_harness/`, with [`types.py`](https://github.com/openai/openai-cookbook/blob/main/types.py) defining data models, [`runners.py`](https://github.com/openai/openai-cookbook/blob/main/runners.py) handling API calls, [`graders.py`](https://github.com/openai/openai-cookbook/blob/main/graders.py) implementing LLM-based scoring, and [`evaluate.py`](https://github.com/openai/openai-cookbook/blob/main/evaluate.py) providing the orchestration loop. The CLI entry point is located at [`examples/evals/imagegen_evals/generation_harness/run_imagegen_evals.py`](https://github.com/openai/openai-cookbook/blob/main/examples/evals/imagegen_evals/generation_harness/run_imagegen_evals.py).

### How does the harness convert generated images for judge model evaluation?

The harness uses the `image_to_data_url` function defined in [`vision_harness/io.py`](https://github.com/openai/openai-cookbook/blob/main/vision_harness/io.py) to read PNG or JPEG files and base64-encode them into data URLs. These URLs are embedded in chat messages sent to the judge model via the `build_generation_judge_content` helper, allowing the LLM to "see" the generated images.

### Can I evaluate image editing tasks in addition to image generation?

Yes. The [`vision_harness/runners.py`](https://github.com/openai/openai-cookbook/blob/main/vision_harness/runners.py) module supports both generation and editing task types. When `task_type` is set to image editing, the runner calls `client.images.edit()` instead of `client.images.generate()`, passing reference images and masks as specified in the `TestCase` configuration.

### What judge model does the harness use by default, and can I change it?

The default judge model is `gpt-5.2`, specified in the `LLMajRubricGrader` initialization. You can override this by passing a different model identifier to the `judge_model` parameter when instantiating the grader, allowing you to use GPT-4o, o1, or other vision-capable models for evaluation.