how-to-guide

How to Implement Out-of-Band Transcription with OpenAI Realtime Sessions

March 2, 2026 openai/openai-cookbook ↗

Out-of-band transcription lets you generate verbatim transcripts from Realtime session audio without mutating the active conversation state by setting "conversation": "none" in a secondary response.create request.

The OpenAI Cookbook demonstrates how to leverage the Realtime API for dual-purpose conversational and transcription workflows. By implementing out-of-band transcription with Realtime sessions, you can extract high-fidelity text from user audio using the exact same gpt-4o-realtime model instance that handles your dialogue, eliminating drift between separate transcription services while keeping the chat history pristine.

Understanding Out-of-Band Transcription

By default, the Realtime API operates in conversation: "auto" mode, where every model response is appended to the active chat history. Out-of-band transcription breaks this pattern by issuing a second response.create call on the same WebSocket with "conversation": "none", instructing the server to process the audio and return text without writing the response back into the conversation buffer.

This technique provides three distinct advantages:

Single-session consistency: The transcription is generated by the exact same model instance handling the conversation, eliminating semantic drift between separate endpoints like Whisper or gpt-4o-transcribe.
Context awareness: You can optionally inject prior conversation turns into the transcription request, allowing the model to resolve ambiguities using established terminology or speaker names.
Simplified client logic: All processing occurs over the existing WebSocket connection—no additional API endpoints, authentication flows, or client-side audio buffering required.

According to the source code in examples/Realtime_out_of_band_transcription.ipynb, this pattern is particularly valuable when you need accurate transcripts for logging or downstream processing without polluting the model's context window with transcription artifacts.

Architecture and Implementation Flow

The implementation follows a specific sequence to isolate the transcription from the conversational stream.

1. Establish the Realtime Session

First, create a persistent WebSocket connection that will handle both the conversational flow and the out-of-band requests.

import openai

client = openai.OpenAI()
connection = await client.realtime.create()

2. Stream User Audio

Send audio chunks to the model using the standard audio.create method. The model processes this audio and prepares its conversational response, updating the session state normally.

with open("user_speech.webm", "rb") as f:
    audio_bytes = f.read()

await connection.audio.create(
    data=audio_bytes,
    mime_type="audio/webm"
)

3. Trigger the Out-of-Band Request

Initiate a second response.create call with the critical "conversation": "none" flag. As implemented in the cookbook at line 598 of Realtime_out_of_band_transcription.ipynb, this flag prevents the server from mutating the active conversation state.

transcription_payload = {
    "model": "gpt-4o-realtime",
    "conversation": "none",  # Out-of-band flag

    "instructions": (
        "You are a transcription assistant. Return the verbatim transcript "
        "of the supplied user audio."
    ),
    "response_format": {"type": "text"},
}

await connection.response.create(transcription_payload)

The instructions field functions as a system prompt that overrides the conversational persona, directing the model to produce pure transcription rather than dialogue.

4. Capture the Transcript

Listen for response.delta events to retrieve the transcription text. Because the request was out-of-band, the conversation buffer remains unchanged, but you receive the high-fidelity transcript through the same event stream.

async for event in connection.listen():
    if event["type"] == "response.delta":
        transcript = event["delta"]["content"]
        print("Transcript:", transcript)
        break

Optional Context Injection

While out-of-band transcription isolates the response from history, you can still leverage conversational context for accuracy. The messages parameter in the transcription request accepts a subset of prior turns, allowing the model to resolve ambiguous pronunciations or domain-specific terminology using established context.

As noted in the cookbook at line 1428, including context increases token costs because the model re-processes the supplied history. Selective inclusion—such as providing only the last two turns versus the full session—allows you to balance transcription accuracy against compute expenses.

Cost Considerations

Out-of-band transcription incurs higher compute costs than dedicated transcription endpoints like Whisper. Because the Realtime model recomputes the entire supplied context (including any optional message history you include) to generate the transcript, you pay for the full contextual processing rather than a streamlined audio-to-text conversion.

The notebook at examples/Realtime_out_of_band_transcription.ipynb provides a detailed cost breakdown at line 1428, demonstrating how context window selection directly impacts pricing.

Key Files in the Repository

The OpenAI Cookbook contains multiple reference implementations of this pattern:

File	Purpose
`examples/Realtime_out_of_band_transcription.ipynb`	Complete notebook walkthrough demonstrating the technique, system prompts, and cost analysis.
`registry.yaml`	Registers the example under the entry `realtime-out-of-band-transcription` for the cookbook UI.
`examples/evals/realtime_evals/shared/realtime_harness_utils.py`	Contains low-level helper functions that perform `await connection.response.create(...)` calls, with line 156 showing the implementation pattern and line 332 demonstrating bare `response.create()` usage.

Summary

Use "conversation": "none" in response.create to generate transcripts without polluting chat history.
Reuse the same model instance (gpt-4o-realtime) for both dialogue and transcription to ensure consistency.
Supply a transcription-focused system prompt via the instructions field to override conversational behavior.
Monitor context costs carefully; out-of-band transcription is more expensive than dedicated endpoints due to full context recomputation.
Reference the cookbook notebook at examples/Realtime_out_of_band_transcription.ipynb for production-ready implementations.

Frequently Asked Questions

What is the difference between default Realtime transcription and out-of-band transcription?

Default Realtime processing appends transcription results to the conversation history, treating the transcript as part of the dialogue. Out-of-band transcription uses "conversation": "none" to process audio and return text without mutating the session state, keeping the chat history clean while still using the same underlying model instance.

How do I prevent the transcription from appearing in the chat history?

Set the request-level flag "conversation": "none" in your response.create payload. According to the source code in examples/Realtime_out_of_band_transcription.ipynb at line 598, this flag explicitly instructs the server not to write the response back into the active conversation buffer.

Can I use context from previous turns in out-of-band transcription?

Yes. Include a messages array in your transcription request containing relevant prior turns. This allows the model to resolve ambiguities using established context, though as documented at line 1428 of the cookbook notebook, this increases costs because the model re-processes the supplied history.

Why is out-of-band transcription more expensive than using Whisper?

The Realtime model recomputes the entire supplied context—including any conversation history you provide—to generate the transcript, whereas Whisper performs a streamlined audio-to-text conversion without context awareness. This full contextual processing consumes more tokens, making out-of-band transcription costlier per minute of audio when context is included.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how openai/openai-cookbook works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →