How to Evaluate Image Inputs with the OpenAI Evals API: A Complete Guide

The OpenAI Cookbook provides a vision-eval harness that automates end-to-end evaluation of image-generation models by combining the Images API with LLM-based rubric grading to produce structured, reproducible scores.

Evaluating image inputs with the Evals API requires a structured pipeline that generates images, persists artifacts, and judges outputs against defined criteria. The openai/openai-cookbook repository ships a complete reference implementation in examples/evals/imagegen_evals/ that demonstrates how to evaluate DALL-E, GPT-Image, and other vision models programmatically.

Architecture of the Vision-Eval Harness

The evaluation system is organized into three distinct layers that separate test definition, execution, and grading concerns.

Layer 1: Test Definition

The foundation of any evaluation is the TestCase data class defined in examples/evals/imagegen_evals/vision_harness/types.py. This class encapsulates the prompt, evaluation criteria, and task type (image generation or editing) for a single evaluation scenario.

from vision_harness.types import TestCase

ui_case = TestCase(
    id="ui_checkout_mockup",
    task_type="image_generation",
    prompt="Generate a high-fidelity mobile checkout screen with a Checkout button...",
    criteria="The image must contain a Checkout button, Place Order CTA, and price display",
)

Layer 2: Runner and Storage

The vision_harness/runners.py module handles API calls to client.images.generate() or client.images.edit(), while vision_harness/storage.py manages deterministic file paths for raw image bytes. Together, these components create Artifact objects that normalize outputs for downstream processing.

Layer 3: Grader and Orchestration

The vision_harness/graders.py file implements LLMajRubricGrader, which uses a judge model (default gpt-5.2) to score images against JSON schemas. The vision_harness/evaluate.py module provides the main orchestration loop that wires cases, runs, and graders together.

Defining Evaluation Cases and Model Runs

Before executing an evaluation, you must define both the test cases and the model configurations to test against.

Creating TestCase Objects

Each TestCase in vision_harness/types.py requires a unique identifier, task type, prompt string, and evaluation criteria. The criteria field typically contains natural language instructions that the judge model will use to assess the generated image.

from vision_harness.types import TestCase, ModelRun

# Define what to test

ui_case = TestCase(
    id="ui_checkout_mockup",
    task_type="image_generation",
    prompt="Generate a high-fidelity mobile checkout screen...",
    criteria="Must include Checkout button, Place Order CTA, and price display",
)

# Configure how to test it

ui_run = ModelRun(
    label="gpt-image-1.5-ui",
    task_type="image_generation",
    params={"model": "gpt-image-1.5", "n": 1, "size": "1024x1024"},
)

The ModelRun class bundles model-specific parameters including the model name, number of images to generate (n), and image dimensions (size).

Converting Images for Judge Model Consumption

When grading generated images, the harness must convert local image files into data URLs that can be sent to the OpenAI chat completions API. The image_to_data_url function in vision_harness/io.py handles this conversion by base64-encoding PNG or JPEG files.


# vision_harness/io.py

def image_to_data_url(path: Path) -> str:
    mime = _MIME_BY_SUFFIX.get(path.suffix.lower(), "image/png")
    b64 = base64.b64encode(path.read_bytes()).decode("utf-8")
    return f"data:{mime};base64,{b64}"

This helper is invoked internally by build_generation_judge_content to construct the multi-modal chat messages that include both the evaluation criteria and the generated image as a data URL.

Implementing LLM-Based Rubric Grading

The LLMajRubricGrader class in vision_harness/graders.py implements the core judging logic. It sends structured requests to a judge model along with a strict JSON schema that defines required evaluation fields.

Configuring the Grader

from vision_harness.graders import LLMajRubricGrader, build_generation_judge_content
from vision_harness.types import Score

ui_grader = LLMajRubricGrader(
    key="ui_eval",
    system_prompt="You are an expert UI/UX evaluator...",
    content_builder=build_generation_judge_content,
    judge_model="gpt-5.2",
    json_schema_name="ui_mockup_eval",
    json_schema={
        "type": "object",
        "properties": {
            "verdict": {"type": "string"},
            "instruction_following": {"type": "boolean"},
            "layout_hierarchy": {"type": "number"},
            "in_image_text_rendering": {"type": "boolean"},
            "ui_affordance_rendering": {"type": "number"},
            "reason": {"type": "string"},
        },
        "required": [
            "verdict", "instruction_following", "layout_hierarchy",
            "in_image_text_rendering", "ui_affordance_rendering", "reason"
        ],
        "additionalProperties": False,
    },
    result_parser=lambda data, _: [
        Score(key="instruction_following", value=bool(data["instruction_following"]), reason=""),
        Score(key="layout_hierarchy", value=float(data["layout_hierarchy"]), reason=""),
        Score(key="verdict", value=str(data["verdict"]), reason=data.get("reason", "").strip()),
    ],
)

The result_parser lambda converts the judge's JSON response into a list of Score objects that the harness can aggregate and report.

Executing the Full Evaluation Pipeline

The evaluate() function in vision_harness/evaluate.py orchestrates the complete workflow: running models, storing artifacts, invoking graders, and returning structured results.

from openai import OpenAI
from pathlib import Path
from vision_harness.evaluate import evaluate
from vision_harness.storage import OutputStore

client = OpenAI()  # Reads OPENAI_API_KEY from environment

output_store = OutputStore(root=Path("./tmp_artifacts"))
results = evaluate(
    cases=[ui_case],
    model_runs=[ui_run],
    graders=[ui_grader],
    output_store=output_store,
)

# Access structured scores

for r in results:
    print(r["scores"])

According to the openai-cookbook source code, the evaluate function automatically handles batching, error handling, and artifact persistence to the OutputStore directory.

Optional OCR Validation for Text-Heavy Images

For evaluations requiring text accuracy verification—such as flyer generation—the harness includes an additional post-processing step. The extract_text_from_flyer function sends the generated image to the OpenAI Responses endpoint to extract line-by-line text, then compares it against a required set of strings (REQUIRED_TEXT).

This demonstrates how to wire custom validation logic into the standard evaluation pipeline without modifying the core harness code.

Customizing and Extending the Workflow

The vision-eval harness supports several extension patterns for specialized use cases.

  • Swap prompts and criteria – Edit the UI_PROMPT, UI_CRITERIA, or create new TestCase objects with domain-specific evaluation rubrics.
  • Add new judges – Provide alternative system_prompt and JSON schema parameters to LLMajRubricGrader for different evaluation dimensions (aesthetic quality, brand safety, etc.).
  • Test different image models – Change the params["model"] field in ModelRun to evaluate dall-e-3, gpt-image-1.5, or future vision models.
  • Run batch sweeps – Pass multiple ModelRun objects with distinct parameters to evaluate() to compare model variants in a single execution.

Summary

  • The vision-eval harness in examples/evals/imagegen_evals/ provides a three-layer architecture for evaluating image inputs with the Evals API: test definition (types.py), execution (runners.py, storage.py), and grading (graders.py).
  • TestCase objects encapsulate prompts and criteria, while ModelRun configurations specify which models and parameters to evaluate.
  • The image_to_data_url helper in vision_harness/io.py converts local images to base64 data URLs for multi-modal judge model consumption.
  • LLMajRubricGrader sends structured JSON schema requests to judge models (default gpt-5.2) and parses responses into Score objects.
  • The evaluate() function in vision_harness/evaluate.py orchestrates the complete pipeline from generation to structured scoring, with optional OCR validation for text accuracy.

Frequently Asked Questions

What file paths contain the core vision-eval harness implementation?

The core implementation resides in examples/evals/imagegen_evals/vision_harness/, with types.py defining data models, runners.py handling API calls, graders.py implementing LLM-based scoring, and evaluate.py providing the orchestration loop. The CLI entry point is located at examples/evals/imagegen_evals/generation_harness/run_imagegen_evals.py.

How does the harness convert generated images for judge model evaluation?

The harness uses the image_to_data_url function defined in vision_harness/io.py to read PNG or JPEG files and base64-encode them into data URLs. These URLs are embedded in chat messages sent to the judge model via the build_generation_judge_content helper, allowing the LLM to "see" the generated images.

Can I evaluate image editing tasks in addition to image generation?

Yes. The vision_harness/runners.py module supports both generation and editing task types. When task_type is set to image editing, the runner calls client.images.edit() instead of client.images.generate(), passing reference images and masks as specified in the TestCase configuration.

What judge model does the harness use by default, and can I change it?

The default judge model is gpt-5.2, specified in the LLMajRubricGrader initialization. You can override this by passing a different model identifier to the judge_model parameter when instantiating the grader, allowing you to use GPT-4o, o1, or other vision-capable models for evaluation.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →