# How to Fine-tune GPT-4o for Visual Question Answering: A Complete Guide

> Master fine-tuning GPT-4o for visual question answering with this comprehensive guide. Enhance your model's visual reasoning on specialized datasets like OCR-VQA for improved accuracy.

- Repository: [OpenAI/openai-cookbook](https://github.com/openai/openai-cookbook)
- Tags: tutorial
- Published: 2026-03-02

---

**Fine-tuning GPT-4o on domain-specific image-question pairs improves visual reasoning accuracy by training the model to recognize patterns in specialized datasets like OCR-VQA.**

GPT-4o's multimodal architecture supports end-to-end fine-tuning for visual question answering (VQA) tasks, allowing developers to specialize the model for specific visual domains. The OpenAI Cookbook provides a complete implementation in `examples/multimodal/Vision_Fine_tuning_on_GPT4o_for_Visual_Question_Answering.ipynb` that demonstrates the full pipeline from raw images to a deployed fine-tuned model. This guide walks through the technical implementation, including data preparation, message formatting, and evaluation strategies used in the official notebook.

## Dataset Selection and Image Preprocessing

The reference implementation uses the public **OCR-VQA dataset** containing approximately 200,000 book-cover images paired with question-answer pairs. For efficient experimentation, the notebook samples a small subset: 150 training examples, 50 validation examples, and 100 test examples.

Images require conversion to base64-encoded JPEG format before inclusion in the training JSONL. The `encode_image` helper function handles this transformation by converting PIL Image objects to RGB color space, compressing them to JPEG with configurable quality settings, and returning base64 strings suitable for data URLs.

```python
from datasets import load_dataset
from PIL import Image
import base64
from io import BytesIO

# Load the OCR-VQA dataset (or any VQA dataset)

ds = load_dataset("howard-hou/OCR-VQA")

def encode_image(image, quality=100):
    if image.mode != "RGB":
        image = image.convert("RGB")
    buffer = BytesIO()
    image.save(buffer, format="JPEG", quality=quality)
    return base64.b64encode(buffer.getvalue()).decode("utf-8")

```

## Constructing Multimodal Training Examples

Each training example follows the **Chat Completions message format** required by the fine-tuning API. The structure consists of a system message containing task instructions, optional few-shot examples to bias output style, a user message combining the question and base64-encoded image, and an assistant message containing the ground-truth answer.

The system prompt defines the model's behavior for visual question answering, specifying how to handle both open-ended and binary yes/no questions. Few-shot examples embedded in every training instance reinforce the desired response format and reduce variability during inference.

```python
SYSTEM_PROMPT = """
Generate an answer to the question based on the image of the book provided.
Questions will include both open-ended questions and binary "yes/no" questions.
...
"""

def build_example(question, answer, pil_image):
    system_msg = {"role": "system",
                  "content": [{"type": "text", "text": SYSTEM_PROMPT}]}
    user_msg = {"role": "user",
                "content": [
                    {"type": "text", "text": question},
                    {"type": "image_url",
                     "image_url": {"url": f"data:image/jpeg;base64,{encode_image(pil_image, quality=50)}"}}
                ]}
    assistant_msg = {"role": "assistant",
                     "content": [{"type": "text", "text": answer}]}
    # prepend a few-shot set (FEW_SHOT_EXAMPLES defined in the notebook)

    return {"messages": [system_msg] + FEW_SHOT_EXAMPLES + [user_msg, assistant_msg]}

```

## Serializing and Uploading Training Data

Training, validation, and test splits serialize to separate `.jsonl` files with one JSON object per line. Each object contains a `messages` array following the schema above. The test file omits the assistant message since the model must generate answers during evaluation.

```python
import json
import tqdm

jsonl_path = "ocr-vqa-train.jsonl"
with open(jsonl_path, "w") as f:
    for idx, row in tqdm.tqdm(ds["train"].iterrows(), total=len(ds["train"])):
        example = build_example(row["question"], row["answer"], row["image"])
        json.dump(example, f)
        f.write("\n")

```

Upload the JSONL files via the Files API with `purpose="fine-tune"` before creating the training job. Validation files follow the same schema as training files and enable monitoring of validation metrics during training.

```python
from openai import OpenAI
client = OpenAI()

train_file = client.files.create(file=open("ocr-vqa-train.jsonl", "rb"),
                                purpose="fine-tune")
val_file = client.files.create(file=open("ocr-vqa-validation.jsonl", "rb"),
                              purpose="fine-tune")

```

## Launching the Fine-tuning Job

Create the fine-tuning job using the stable GPT-4o snapshot `gpt-4o-2024-08-06` as the base model. The job runs asynchronously and progress can be monitored through the OpenAI Platform UI or via API polling.

```python
ft_job = client.fine_tuning.jobs.create(
    training_file=train_file.id,
    validation_file=val_file.id,
    model="gpt-4o-2024-08-06"
)
print(f"Fine-tuning job ID: {ft_job.id}")

```

Upon completion, the API provides a fine-tuned model identifier following the format `ft:gpt-4o-2024-08-06:openai::YOUR_SUFFIX`. This identifier replaces the base model name in subsequent inference calls.

## Inference and Evaluation Strategies

Query the fine-tuned model using the same message structure as training, excluding the assistant message. The model generates answers based on learned visual patterns from the specialized domain.

```python
ft_model = "ft:gpt-4o-2024-08-06:openai::YOUR_SUFFIX"
response = client.chat.completions.create(
    model=ft_model,
    messages=test_example["messages"]
)
print("Predicted answer:", response.choices[0].message.content.strip())

```

Evaluation uses a two-tier approach. First, calculate exact-match accuracy for closed-form (yes/no) questions using string comparison. Second, use GPT-4o itself as a judge to grade open-ended answer similarity on a scale of Very Similar, Mostly Similar, Somewhat Similar, or Incorrect. This automated grading captures semantic correctness beyond exact string matching.

```python
def is_correct(pred, truth):
    return pred.lower() == truth.lower() or truth.lower() in pred.lower()

correct = sum(is_correct(r["predicted_answer"], r["actual_answer"])
               for r in results_if_closed_form)
accuracy = 100 * correct / len(results_if_closed_form)
print(f"Closed-form accuracy: {accuracy:.2f}%")

```

## Summary

- **Multimodal fine-tuning** enables GPT-4o to learn domain-specific visual patterns, improving accuracy on specialized VQA tasks beyond the base model's general capabilities.
- The **OCR-VQA dataset** provides a reference implementation using book covers, but any image-question-answer dataset works with the same pipeline.
- **Base64-JPEG encoding** converts PIL images to data URLs compatible with the Chat Completions API format required for fine-tuning.
- **Few-shot examples** embedded in training messages reduce output variability and enforce consistent response formatting.
- The **fine-tuning API** accepts JSONL files via the Files API with `purpose="fine-tune"` and uses `gpt-4o-2024-08-06` as the stable base model identifier.
- **Automated evaluation** using GPT-4o as a judge provides nuanced accuracy metrics for open-ended questions where exact matching fails.

## Frequently Asked Questions

### What image formats does GPT-4o fine-tuning support?

The fine-tuning pipeline accepts standard image formats through the **base64-encoded data URL** method. The reference implementation specifically uses JPEG encoding with configurable quality settings via the `encode_image` function, converting PIL Image objects to RGB color space before base64 serialization.

### How many training examples are needed for effective VQA fine-tuning?

The official notebook demonstrates meaningful results with as few as **150 training examples** and 50 validation examples when working with the OCR-VQA dataset. However, performance scales with data quality and domain specificity; specialized medical imaging or technical diagrams may require larger datasets to capture relevant visual patterns.

### Can I fine-tune GPT-4o on custom image domains beyond book covers?

Yes. While the cookbook example uses book-cover images from OCR-VQA, the architecture supports any visual domain. Replace the dataset loader, adjust the **system prompt** to describe your specific visual task, and optionally curate domain-specific **few-shot examples** to guide the model's reasoning style for your particular use case.

### How do I evaluate open-ended questions where answers may vary?

For questions without fixed answers, exact string matching proves insufficient. The reference implementation uses **GPT-4o as an evaluator** to grade answer similarity on a four-point scale from Very Similar to Incorrect. This automated judging approach captures semantic equivalence and partial credit better than binary correctness metrics.