How to Evaluate Image Inputs with the OpenAI Evals API: A Complete Guide
The OpenAI Cookbook provides a vision-eval harness that automates end-to-end evaluation of image-generation models by combining the Images API with LLM-based rubric grading to produce structured, reproducible scores.
Evaluating image inputs with the Evals API requires a structured pipeline that generates images, persists artifacts, and judges outputs against defined criteria. The openai/openai-cookbook repository ships a complete reference implementation in examples/evals/imagegen_evals/ that demonstrates how to evaluate DALL-E, GPT-Image, and other vision models programmatically.
Architecture of the Vision-Eval Harness
The evaluation system is organized into three distinct layers that separate test definition, execution, and grading concerns.
Layer 1: Test Definition
The foundation of any evaluation is the TestCase data class defined in examples/evals/imagegen_evals/vision_harness/types.py. This class encapsulates the prompt, evaluation criteria, and task type (image generation or editing) for a single evaluation scenario.
from vision_harness.types import TestCase
ui_case = TestCase(
id="ui_checkout_mockup",
task_type="image_generation",
prompt="Generate a high-fidelity mobile checkout screen with a Checkout button...",
criteria="The image must contain a Checkout button, Place Order CTA, and price display",
)
Layer 2: Runner and Storage
The vision_harness/runners.py module handles API calls to client.images.generate() or client.images.edit(), while vision_harness/storage.py manages deterministic file paths for raw image bytes. Together, these components create Artifact objects that normalize outputs for downstream processing.
Layer 3: Grader and Orchestration
The vision_harness/graders.py file implements LLMajRubricGrader, which uses a judge model (default gpt-5.2) to score images against JSON schemas. The vision_harness/evaluate.py module provides the main orchestration loop that wires cases, runs, and graders together.
Defining Evaluation Cases and Model Runs
Before executing an evaluation, you must define both the test cases and the model configurations to test against.
Creating TestCase Objects
Each TestCase in vision_harness/types.py requires a unique identifier, task type, prompt string, and evaluation criteria. The criteria field typically contains natural language instructions that the judge model will use to assess the generated image.
from vision_harness.types import TestCase, ModelRun
# Define what to test
ui_case = TestCase(
id="ui_checkout_mockup",
task_type="image_generation",
prompt="Generate a high-fidelity mobile checkout screen...",
criteria="Must include Checkout button, Place Order CTA, and price display",
)
# Configure how to test it
ui_run = ModelRun(
label="gpt-image-1.5-ui",
task_type="image_generation",
params={"model": "gpt-image-1.5", "n": 1, "size": "1024x1024"},
)
The ModelRun class bundles model-specific parameters including the model name, number of images to generate (n), and image dimensions (size).
Converting Images for Judge Model Consumption
When grading generated images, the harness must convert local image files into data URLs that can be sent to the OpenAI chat completions API. The image_to_data_url function in vision_harness/io.py handles this conversion by base64-encoding PNG or JPEG files.
# vision_harness/io.py
def image_to_data_url(path: Path) -> str:
mime = _MIME_BY_SUFFIX.get(path.suffix.lower(), "image/png")
b64 = base64.b64encode(path.read_bytes()).decode("utf-8")
return f"data:{mime};base64,{b64}"
This helper is invoked internally by build_generation_judge_content to construct the multi-modal chat messages that include both the evaluation criteria and the generated image as a data URL.
Implementing LLM-Based Rubric Grading
The LLMajRubricGrader class in vision_harness/graders.py implements the core judging logic. It sends structured requests to a judge model along with a strict JSON schema that defines required evaluation fields.
Configuring the Grader
from vision_harness.graders import LLMajRubricGrader, build_generation_judge_content
from vision_harness.types import Score
ui_grader = LLMajRubricGrader(
key="ui_eval",
system_prompt="You are an expert UI/UX evaluator...",
content_builder=build_generation_judge_content,
judge_model="gpt-5.2",
json_schema_name="ui_mockup_eval",
json_schema={
"type": "object",
"properties": {
"verdict": {"type": "string"},
"instruction_following": {"type": "boolean"},
"layout_hierarchy": {"type": "number"},
"in_image_text_rendering": {"type": "boolean"},
"ui_affordance_rendering": {"type": "number"},
"reason": {"type": "string"},
},
"required": [
"verdict", "instruction_following", "layout_hierarchy",
"in_image_text_rendering", "ui_affordance_rendering", "reason"
],
"additionalProperties": False,
},
result_parser=lambda data, _: [
Score(key="instruction_following", value=bool(data["instruction_following"]), reason=""),
Score(key="layout_hierarchy", value=float(data["layout_hierarchy"]), reason=""),
Score(key="verdict", value=str(data["verdict"]), reason=data.get("reason", "").strip()),
],
)
The result_parser lambda converts the judge's JSON response into a list of Score objects that the harness can aggregate and report.
Executing the Full Evaluation Pipeline
The evaluate() function in vision_harness/evaluate.py orchestrates the complete workflow: running models, storing artifacts, invoking graders, and returning structured results.
from openai import OpenAI
from pathlib import Path
from vision_harness.evaluate import evaluate
from vision_harness.storage import OutputStore
client = OpenAI() # Reads OPENAI_API_KEY from environment
output_store = OutputStore(root=Path("./tmp_artifacts"))
results = evaluate(
cases=[ui_case],
model_runs=[ui_run],
graders=[ui_grader],
output_store=output_store,
)
# Access structured scores
for r in results:
print(r["scores"])
According to the openai-cookbook source code, the evaluate function automatically handles batching, error handling, and artifact persistence to the OutputStore directory.
Optional OCR Validation for Text-Heavy Images
For evaluations requiring text accuracy verification—such as flyer generation—the harness includes an additional post-processing step. The extract_text_from_flyer function sends the generated image to the OpenAI Responses endpoint to extract line-by-line text, then compares it against a required set of strings (REQUIRED_TEXT).
This demonstrates how to wire custom validation logic into the standard evaluation pipeline without modifying the core harness code.
Customizing and Extending the Workflow
The vision-eval harness supports several extension patterns for specialized use cases.
- Swap prompts and criteria – Edit the
UI_PROMPT,UI_CRITERIA, or create newTestCaseobjects with domain-specific evaluation rubrics. - Add new judges – Provide alternative
system_promptand JSON schema parameters toLLMajRubricGraderfor different evaluation dimensions (aesthetic quality, brand safety, etc.). - Test different image models – Change the
params["model"]field inModelRunto evaluatedall-e-3,gpt-image-1.5, or future vision models. - Run batch sweeps – Pass multiple
ModelRunobjects with distinct parameters toevaluate()to compare model variants in a single execution.
Summary
- The vision-eval harness in
examples/evals/imagegen_evals/provides a three-layer architecture for evaluating image inputs with the Evals API: test definition (types.py), execution (runners.py,storage.py), and grading (graders.py). TestCaseobjects encapsulate prompts and criteria, whileModelRunconfigurations specify which models and parameters to evaluate.- The
image_to_data_urlhelper invision_harness/io.pyconverts local images to base64 data URLs for multi-modal judge model consumption. LLMajRubricGradersends structured JSON schema requests to judge models (defaultgpt-5.2) and parses responses intoScoreobjects.- The
evaluate()function invision_harness/evaluate.pyorchestrates the complete pipeline from generation to structured scoring, with optional OCR validation for text accuracy.
Frequently Asked Questions
What file paths contain the core vision-eval harness implementation?
The core implementation resides in examples/evals/imagegen_evals/vision_harness/, with types.py defining data models, runners.py handling API calls, graders.py implementing LLM-based scoring, and evaluate.py providing the orchestration loop. The CLI entry point is located at examples/evals/imagegen_evals/generation_harness/run_imagegen_evals.py.
How does the harness convert generated images for judge model evaluation?
The harness uses the image_to_data_url function defined in vision_harness/io.py to read PNG or JPEG files and base64-encode them into data URLs. These URLs are embedded in chat messages sent to the judge model via the build_generation_judge_content helper, allowing the LLM to "see" the generated images.
Can I evaluate image editing tasks in addition to image generation?
Yes. The vision_harness/runners.py module supports both generation and editing task types. When task_type is set to image editing, the runner calls client.images.edit() instead of client.images.generate(), passing reference images and masks as specified in the TestCase configuration.
What judge model does the harness use by default, and can I change it?
The default judge model is gpt-5.2, specified in the LLMajRubricGrader initialization. You can override this by passing a different model identifier to the judge_model parameter when instantiating the grader, allowing you to use GPT-4o, o1, or other vision-capable models for evaluation.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →