How to Fine-tune GPT-4o for Visual Question Answering: A Complete Guide
Fine-tuning GPT-4o on domain-specific image-question pairs improves visual reasoning accuracy by training the model to recognize patterns in specialized datasets like OCR-VQA.
GPT-4o's multimodal architecture supports end-to-end fine-tuning for visual question answering (VQA) tasks, allowing developers to specialize the model for specific visual domains. The OpenAI Cookbook provides a complete implementation in examples/multimodal/Vision_Fine_tuning_on_GPT4o_for_Visual_Question_Answering.ipynb that demonstrates the full pipeline from raw images to a deployed fine-tuned model. This guide walks through the technical implementation, including data preparation, message formatting, and evaluation strategies used in the official notebook.
Dataset Selection and Image Preprocessing
The reference implementation uses the public OCR-VQA dataset containing approximately 200,000 book-cover images paired with question-answer pairs. For efficient experimentation, the notebook samples a small subset: 150 training examples, 50 validation examples, and 100 test examples.
Images require conversion to base64-encoded JPEG format before inclusion in the training JSONL. The encode_image helper function handles this transformation by converting PIL Image objects to RGB color space, compressing them to JPEG with configurable quality settings, and returning base64 strings suitable for data URLs.
from datasets import load_dataset
from PIL import Image
import base64
from io import BytesIO
# Load the OCR-VQA dataset (or any VQA dataset)
ds = load_dataset("howard-hou/OCR-VQA")
def encode_image(image, quality=100):
if image.mode != "RGB":
image = image.convert("RGB")
buffer = BytesIO()
image.save(buffer, format="JPEG", quality=quality)
return base64.b64encode(buffer.getvalue()).decode("utf-8")
Constructing Multimodal Training Examples
Each training example follows the Chat Completions message format required by the fine-tuning API. The structure consists of a system message containing task instructions, optional few-shot examples to bias output style, a user message combining the question and base64-encoded image, and an assistant message containing the ground-truth answer.
The system prompt defines the model's behavior for visual question answering, specifying how to handle both open-ended and binary yes/no questions. Few-shot examples embedded in every training instance reinforce the desired response format and reduce variability during inference.
SYSTEM_PROMPT = """
Generate an answer to the question based on the image of the book provided.
Questions will include both open-ended questions and binary "yes/no" questions.
...
"""
def build_example(question, answer, pil_image):
system_msg = {"role": "system",
"content": [{"type": "text", "text": SYSTEM_PROMPT}]}
user_msg = {"role": "user",
"content": [
{"type": "text", "text": question},
{"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{encode_image(pil_image, quality=50)}"}}
]}
assistant_msg = {"role": "assistant",
"content": [{"type": "text", "text": answer}]}
# prepend a few-shot set (FEW_SHOT_EXAMPLES defined in the notebook)
return {"messages": [system_msg] + FEW_SHOT_EXAMPLES + [user_msg, assistant_msg]}
Serializing and Uploading Training Data
Training, validation, and test splits serialize to separate .jsonl files with one JSON object per line. Each object contains a messages array following the schema above. The test file omits the assistant message since the model must generate answers during evaluation.
import json
import tqdm
jsonl_path = "ocr-vqa-train.jsonl"
with open(jsonl_path, "w") as f:
for idx, row in tqdm.tqdm(ds["train"].iterrows(), total=len(ds["train"])):
example = build_example(row["question"], row["answer"], row["image"])
json.dump(example, f)
f.write("\n")
Upload the JSONL files via the Files API with purpose="fine-tune" before creating the training job. Validation files follow the same schema as training files and enable monitoring of validation metrics during training.
from openai import OpenAI
client = OpenAI()
train_file = client.files.create(file=open("ocr-vqa-train.jsonl", "rb"),
purpose="fine-tune")
val_file = client.files.create(file=open("ocr-vqa-validation.jsonl", "rb"),
purpose="fine-tune")
Launching the Fine-tuning Job
Create the fine-tuning job using the stable GPT-4o snapshot gpt-4o-2024-08-06 as the base model. The job runs asynchronously and progress can be monitored through the OpenAI Platform UI or via API polling.
ft_job = client.fine_tuning.jobs.create(
training_file=train_file.id,
validation_file=val_file.id,
model="gpt-4o-2024-08-06"
)
print(f"Fine-tuning job ID: {ft_job.id}")
Upon completion, the API provides a fine-tuned model identifier following the format ft:gpt-4o-2024-08-06:openai::YOUR_SUFFIX. This identifier replaces the base model name in subsequent inference calls.
Inference and Evaluation Strategies
Query the fine-tuned model using the same message structure as training, excluding the assistant message. The model generates answers based on learned visual patterns from the specialized domain.
ft_model = "ft:gpt-4o-2024-08-06:openai::YOUR_SUFFIX"
response = client.chat.completions.create(
model=ft_model,
messages=test_example["messages"]
)
print("Predicted answer:", response.choices[0].message.content.strip())
Evaluation uses a two-tier approach. First, calculate exact-match accuracy for closed-form (yes/no) questions using string comparison. Second, use GPT-4o itself as a judge to grade open-ended answer similarity on a scale of Very Similar, Mostly Similar, Somewhat Similar, or Incorrect. This automated grading captures semantic correctness beyond exact string matching.
def is_correct(pred, truth):
return pred.lower() == truth.lower() or truth.lower() in pred.lower()
correct = sum(is_correct(r["predicted_answer"], r["actual_answer"])
for r in results_if_closed_form)
accuracy = 100 * correct / len(results_if_closed_form)
print(f"Closed-form accuracy: {accuracy:.2f}%")
Summary
- Multimodal fine-tuning enables GPT-4o to learn domain-specific visual patterns, improving accuracy on specialized VQA tasks beyond the base model's general capabilities.
- The OCR-VQA dataset provides a reference implementation using book covers, but any image-question-answer dataset works with the same pipeline.
- Base64-JPEG encoding converts PIL images to data URLs compatible with the Chat Completions API format required for fine-tuning.
- Few-shot examples embedded in training messages reduce output variability and enforce consistent response formatting.
- The fine-tuning API accepts JSONL files via the Files API with
purpose="fine-tune"and usesgpt-4o-2024-08-06as the stable base model identifier. - Automated evaluation using GPT-4o as a judge provides nuanced accuracy metrics for open-ended questions where exact matching fails.
Frequently Asked Questions
What image formats does GPT-4o fine-tuning support?
The fine-tuning pipeline accepts standard image formats through the base64-encoded data URL method. The reference implementation specifically uses JPEG encoding with configurable quality settings via the encode_image function, converting PIL Image objects to RGB color space before base64 serialization.
How many training examples are needed for effective VQA fine-tuning?
The official notebook demonstrates meaningful results with as few as 150 training examples and 50 validation examples when working with the OCR-VQA dataset. However, performance scales with data quality and domain specificity; specialized medical imaging or technical diagrams may require larger datasets to capture relevant visual patterns.
Can I fine-tune GPT-4o on custom image domains beyond book covers?
Yes. While the cookbook example uses book-cover images from OCR-VQA, the architecture supports any visual domain. Replace the dataset loader, adjust the system prompt to describe your specific visual task, and optionally curate domain-specific few-shot examples to guide the model's reasoning style for your particular use case.
How do I evaluate open-ended questions where answers may vary?
For questions without fixed answers, exact string matching proves insufficient. The reference implementation uses GPT-4o as an evaluator to grade answer similarity on a four-point scale from Very Similar to Incorrect. This automated judging approach captures semantic equivalence and partial credit better than binary correctness metrics.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →