How to Implement Out-of-Band Transcription with OpenAI Realtime Sessions
Out-of-band transcription lets you generate verbatim transcripts from Realtime session audio without mutating the active conversation state by setting "conversation": "none" in a secondary response.create request.
The OpenAI Cookbook demonstrates how to leverage the Realtime API for dual-purpose conversational and transcription workflows. By implementing out-of-band transcription with Realtime sessions, you can extract high-fidelity text from user audio using the exact same gpt-4o-realtime model instance that handles your dialogue, eliminating drift between separate transcription services while keeping the chat history pristine.
Understanding Out-of-Band Transcription
By default, the Realtime API operates in conversation: "auto" mode, where every model response is appended to the active chat history. Out-of-band transcription breaks this pattern by issuing a second response.create call on the same WebSocket with "conversation": "none", instructing the server to process the audio and return text without writing the response back into the conversation buffer.
This technique provides three distinct advantages:
- Single-session consistency: The transcription is generated by the exact same model instance handling the conversation, eliminating semantic drift between separate endpoints like Whisper or
gpt-4o-transcribe. - Context awareness: You can optionally inject prior conversation turns into the transcription request, allowing the model to resolve ambiguities using established terminology or speaker names.
- Simplified client logic: All processing occurs over the existing WebSocket connection—no additional API endpoints, authentication flows, or client-side audio buffering required.
According to the source code in examples/Realtime_out_of_band_transcription.ipynb, this pattern is particularly valuable when you need accurate transcripts for logging or downstream processing without polluting the model's context window with transcription artifacts.
Architecture and Implementation Flow
The implementation follows a specific sequence to isolate the transcription from the conversational stream.
1. Establish the Realtime Session
First, create a persistent WebSocket connection that will handle both the conversational flow and the out-of-band requests.
import openai
client = openai.OpenAI()
connection = await client.realtime.create()
2. Stream User Audio
Send audio chunks to the model using the standard audio.create method. The model processes this audio and prepares its conversational response, updating the session state normally.
with open("user_speech.webm", "rb") as f:
audio_bytes = f.read()
await connection.audio.create(
data=audio_bytes,
mime_type="audio/webm"
)
3. Trigger the Out-of-Band Request
Initiate a second response.create call with the critical "conversation": "none" flag. As implemented in the cookbook at line 598 of Realtime_out_of_band_transcription.ipynb, this flag prevents the server from mutating the active conversation state.
transcription_payload = {
"model": "gpt-4o-realtime",
"conversation": "none", # Out-of-band flag
"instructions": (
"You are a transcription assistant. Return the verbatim transcript "
"of the supplied user audio."
),
"response_format": {"type": "text"},
}
await connection.response.create(transcription_payload)
The instructions field functions as a system prompt that overrides the conversational persona, directing the model to produce pure transcription rather than dialogue.
4. Capture the Transcript
Listen for response.delta events to retrieve the transcription text. Because the request was out-of-band, the conversation buffer remains unchanged, but you receive the high-fidelity transcript through the same event stream.
async for event in connection.listen():
if event["type"] == "response.delta":
transcript = event["delta"]["content"]
print("Transcript:", transcript)
break
Optional Context Injection
While out-of-band transcription isolates the response from history, you can still leverage conversational context for accuracy. The messages parameter in the transcription request accepts a subset of prior turns, allowing the model to resolve ambiguous pronunciations or domain-specific terminology using established context.
As noted in the cookbook at line 1428, including context increases token costs because the model re-processes the supplied history. Selective inclusion—such as providing only the last two turns versus the full session—allows you to balance transcription accuracy against compute expenses.
Cost Considerations
Out-of-band transcription incurs higher compute costs than dedicated transcription endpoints like Whisper. Because the Realtime model recomputes the entire supplied context (including any optional message history you include) to generate the transcript, you pay for the full contextual processing rather than a streamlined audio-to-text conversion.
The notebook at examples/Realtime_out_of_band_transcription.ipynb provides a detailed cost breakdown at line 1428, demonstrating how context window selection directly impacts pricing.
Key Files in the Repository
The OpenAI Cookbook contains multiple reference implementations of this pattern:
| File | Purpose |
|---|---|
examples/Realtime_out_of_band_transcription.ipynb |
Complete notebook walkthrough demonstrating the technique, system prompts, and cost analysis. |
registry.yaml |
Registers the example under the entry realtime-out-of-band-transcription for the cookbook UI. |
examples/evals/realtime_evals/shared/realtime_harness_utils.py |
Contains low-level helper functions that perform await connection.response.create(...) calls, with line 156 showing the implementation pattern and line 332 demonstrating bare response.create() usage. |
Summary
- Use
"conversation": "none"inresponse.createto generate transcripts without polluting chat history. - Reuse the same model instance (
gpt-4o-realtime) for both dialogue and transcription to ensure consistency. - Supply a transcription-focused system prompt via the
instructionsfield to override conversational behavior. - Monitor context costs carefully; out-of-band transcription is more expensive than dedicated endpoints due to full context recomputation.
- Reference the cookbook notebook at
examples/Realtime_out_of_band_transcription.ipynbfor production-ready implementations.
Frequently Asked Questions
What is the difference between default Realtime transcription and out-of-band transcription?
Default Realtime processing appends transcription results to the conversation history, treating the transcript as part of the dialogue. Out-of-band transcription uses "conversation": "none" to process audio and return text without mutating the session state, keeping the chat history clean while still using the same underlying model instance.
How do I prevent the transcription from appearing in the chat history?
Set the request-level flag "conversation": "none" in your response.create payload. According to the source code in examples/Realtime_out_of_band_transcription.ipynb at line 598, this flag explicitly instructs the server not to write the response back into the active conversation buffer.
Can I use context from previous turns in out-of-band transcription?
Yes. Include a messages array in your transcription request containing relevant prior turns. This allows the model to resolve ambiguities using established context, though as documented at line 1428 of the cookbook notebook, this increases costs because the model re-processes the supplied history.
Why is out-of-band transcription more expensive than using Whisper?
The Realtime model recomputes the entire supplied context—including any conversation history you provide—to generate the transcript, whereas Whisper performs a streamlined audio-to-text conversion without context awareness. This full contextual processing consumes more tokens, making out-of-band transcription costlier per minute of audio when context is included.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →