# How to Implement Out-of-Band Transcription with OpenAI Realtime Sessions

> Learn to implement out-of-band transcription with OpenAI Realtime sessions. Generate verbatim transcripts without altering conversation state by setting conversation none in response create.

- Repository: [OpenAI/openai-cookbook](https://github.com/openai/openai-cookbook)
- Tags: how-to-guide
- Published: 2026-03-02

---

**Out-of-band transcription lets you generate verbatim transcripts from Realtime session audio without mutating the active conversation state by setting `"conversation": "none"` in a secondary `response.create` request.**

The OpenAI Cookbook demonstrates how to leverage the Realtime API for dual-purpose conversational and transcription workflows. By implementing out-of-band transcription with Realtime sessions, you can extract high-fidelity text from user audio using the exact same `gpt-4o-realtime` model instance that handles your dialogue, eliminating drift between separate transcription services while keeping the chat history pristine.

## Understanding Out-of-Band Transcription

By default, the Realtime API operates in `conversation: "auto"` mode, where every model response is appended to the active chat history. **Out-of-band transcription** breaks this pattern by issuing a second `response.create` call on the same WebSocket with `"conversation": "none"`, instructing the server to process the audio and return text without writing the response back into the conversation buffer.

This technique provides three distinct advantages:

- **Single-session consistency**: The transcription is generated by the exact same model instance handling the conversation, eliminating semantic drift between separate endpoints like Whisper or `gpt-4o-transcribe`.
- **Context awareness**: You can optionally inject prior conversation turns into the transcription request, allowing the model to resolve ambiguities using established terminology or speaker names.
- **Simplified client logic**: All processing occurs over the existing WebSocket connection—no additional API endpoints, authentication flows, or client-side audio buffering required.

According to the source code in `examples/Realtime_out_of_band_transcription.ipynb`, this pattern is particularly valuable when you need accurate transcripts for logging or downstream processing without polluting the model's context window with transcription artifacts.

## Architecture and Implementation Flow

The implementation follows a specific sequence to isolate the transcription from the conversational stream.

### 1. Establish the Realtime Session

First, create a persistent WebSocket connection that will handle both the conversational flow and the out-of-band requests.

```python
import openai

client = openai.OpenAI()
connection = await client.realtime.create()

```

### 2. Stream User Audio

Send audio chunks to the model using the standard `audio.create` method. The model processes this audio and prepares its conversational response, updating the session state normally.

```python
with open("user_speech.webm", "rb") as f:
    audio_bytes = f.read()

await connection.audio.create(
    data=audio_bytes,
    mime_type="audio/webm"
)

```

### 3. Trigger the Out-of-Band Request

Initiate a second `response.create` call with the critical `"conversation": "none"` flag. As implemented in the cookbook at line 598 of `Realtime_out_of_band_transcription.ipynb`, this flag prevents the server from mutating the active conversation state.

```python
transcription_payload = {
    "model": "gpt-4o-realtime",
    "conversation": "none",  # Out-of-band flag

    "instructions": (
        "You are a transcription assistant. Return the verbatim transcript "
        "of the supplied user audio."
    ),
    "response_format": {"type": "text"},
}

await connection.response.create(transcription_payload)

```

The `instructions` field functions as a system prompt that overrides the conversational persona, directing the model to produce pure transcription rather than dialogue.

### 4. Capture the Transcript

Listen for `response.delta` events to retrieve the transcription text. Because the request was out-of-band, the conversation buffer remains unchanged, but you receive the high-fidelity transcript through the same event stream.

```python
async for event in connection.listen():
    if event["type"] == "response.delta":
        transcript = event["delta"]["content"]
        print("Transcript:", transcript)
        break

```

## Optional Context Injection

While out-of-band transcription isolates the response from history, you can still leverage conversational context for accuracy. The `messages` parameter in the transcription request accepts a subset of prior turns, allowing the model to resolve ambiguous pronunciations or domain-specific terminology using established context.

As noted in the cookbook at line 1428, including context increases token costs because the model re-processes the supplied history. Selective inclusion—such as providing only the last two turns versus the full session—allows you to balance transcription accuracy against compute expenses.

## Cost Considerations

Out-of-band transcription incurs **higher compute costs** than dedicated transcription endpoints like Whisper. Because the Realtime model recomputes the entire supplied context (including any optional message history you include) to generate the transcript, you pay for the full contextual processing rather than a streamlined audio-to-text conversion.

The notebook at `examples/Realtime_out_of_band_transcription.ipynb` provides a detailed cost breakdown at line 1428, demonstrating how context window selection directly impacts pricing.

## Key Files in the Repository

The OpenAI Cookbook contains multiple reference implementations of this pattern:

| File | Purpose |
|------|---------|
| `examples/Realtime_out_of_band_transcription.ipynb` | Complete notebook walkthrough demonstrating the technique, system prompts, and cost analysis. |
| [`registry.yaml`](https://github.com/openai/openai-cookbook/blob/main/registry.yaml) | Registers the example under the entry `realtime-out-of-band-transcription` for the cookbook UI. |
| [`examples/evals/realtime_evals/shared/realtime_harness_utils.py`](https://github.com/openai/openai-cookbook/blob/main/examples/evals/realtime_evals/shared/realtime_harness_utils.py) | Contains low-level helper functions that perform `await connection.response.create(...)` calls, with line 156 showing the implementation pattern and line 332 demonstrating bare `response.create()` usage. |

## Summary

- **Use `"conversation": "none"`** in `response.create` to generate transcripts without polluting chat history.
- **Reuse the same model instance** (`gpt-4o-realtime`) for both dialogue and transcription to ensure consistency.
- **Supply a transcription-focused system prompt** via the `instructions` field to override conversational behavior.
- **Monitor context costs** carefully; out-of-band transcription is more expensive than dedicated endpoints due to full context recomputation.
- **Reference the cookbook notebook** at `examples/Realtime_out_of_band_transcription.ipynb` for production-ready implementations.

## Frequently Asked Questions

### What is the difference between default Realtime transcription and out-of-band transcription?

Default Realtime processing appends transcription results to the conversation history, treating the transcript as part of the dialogue. **Out-of-band transcription** uses `"conversation": "none"` to process audio and return text without mutating the session state, keeping the chat history clean while still using the same underlying model instance.

### How do I prevent the transcription from appearing in the chat history?

Set the request-level flag `"conversation": "none"` in your `response.create` payload. According to the source code in `examples/Realtime_out_of_band_transcription.ipynb` at line 598, this flag explicitly instructs the server not to write the response back into the active conversation buffer.

### Can I use context from previous turns in out-of-band transcription?

Yes. Include a `messages` array in your transcription request containing relevant prior turns. This allows the model to resolve ambiguities using established context, though as documented at line 1428 of the cookbook notebook, this increases costs because the model re-processes the supplied history.

### Why is out-of-band transcription more expensive than using Whisper?

The Realtime model recomputes the entire supplied context—including any conversation history you provide—to generate the transcript, whereas Whisper performs a streamlined audio-to-text conversion without context awareness. This full contextual processing consumes more tokens, making out-of-band transcription costlier per minute of audio when context is included.