# Building Voice Agents with the OpenAI Realtime API: Architecture and Code Examples

> Learn to build voice agents with the OpenAI Realtime API. Explore a production-ready reference implementation with ESP-32, Deno, and React. Get code examples now.

- Repository: [OpenAI/openai-cookbook](https://github.com/openai/openai-cookbook)
- Tags: how-to-guide
- Published: 2026-03-02

---

**The OpenAI Cookbook provides production-ready reference implementations for building voice agents using the Realtime API, featuring ESP-32 hardware integration, Deno edge runtime functions, and React frontends with server-side voice activity detection.**

The `openai/openai-cookbook` repository contains a comprehensive suite of voice-agent examples demonstrating end-to-end speech-to-speech pipelines. These implementations showcase how to stream audio from IoT devices through edge servers to the Realtime API, handling authentication, turn detection, and multi-language translation in real time. This guide examines the architecture, core concepts, and production code patterns found in the official examples.

## Architecture Overview

The reference implementation spans four distinct layers, each handling specific responsibilities in the audio pipeline from hardware capture to model inference.

### IoT Device Layer (ESP-32)

At the edge, **ESP-32 microcontrollers** capture microphone audio using the Opus codec and transmit streams over secure WebSockets. The firmware handles device authentication and audio compression before transmission to the edge server. Configuration endpoints are defined in the Arduino source files within `examples/voice_solutions/arduino_ai_speech_assets/`.

### Edge Server Layer (Deno/Supabase)

A **Deno-based edge function** running on Supabase Edge Runtime authenticates devices and proxies audio to OpenAI's servers. This layer, documented in [`examples/voice_solutions/running_realtime_api_speech_on_esp32_arduino_edge_runtime_elatoai.md`](https://github.com/openai/openai-cookbook/blob/main/examples/voice_solutions/running_realtime_api_speech_on_esp32_arduino_edge_runtime_elatoai.md), manages environment variables including `OPENAI_API_KEY` and `SUPABASE_ANON_KEY`, while implementing **Voice Activity Detection (VAD)** turn detection and optional encryption.

### Frontend Layer (React/Next.js)

The React application in `examples/voice_solutions/one_way_translation_using_realtime_api/` provides the configuration interface and audio streaming UI. It utilizes the `@openai/realtime-api-beta` client library to manage multiple simultaneous language connections via Socket.io, as implemented in [`src/pages/SpeakerPage.tsx`](https://github.com/openai/openai-cookbook/blob/main/src/pages/SpeakerPage.tsx).

### Evaluation Layer (Python)

For testing and CI integration, the Python evaluation harness in [`examples/evals/realtime_evals/shared/realtime_utils.py`](https://github.com/openai/openai-cookbook/blob/main/examples/evals/realtime_evals/shared/realtime_utils.py) provides deterministic replay tests. The `ToolCallAccumulator` class captures and validates function calls within Realtime API event streams.

## Core Realtime API Concepts

Understanding these fundamental patterns is essential for building responsive voice agents with low-latency audio streaming.

### RealtimeClient Initialization

The **RealtimeClient** class wraps the WebSocket protocol, handling connection management and session configuration. Each client instance maintains an independent audio stream and instruction set. The cookbook demonstrates creating multiple clients for simultaneous translation scenarios.

### Server-Side Voice Activity Detection (VAD)

The Realtime API supports **server-side VAD** via the `turn_detection` parameter. When enabled with `type: 'server_vad'`, the API automatically detects speech boundaries, eliminating the need for manual recording controls. The `changeTurnEndType` function in the React examples toggles this behavior via `client.updateSession({ turn_detection: … })`.

### Audio Streaming and Session Management

Audio flows as **PCM frames** through `client.appendInputAudio()`, while `client.createResponse()` signals the model to process accumulated audio and generate a response. System instructions are injected per client using `client.updateSession({ instructions })`, allowing language-specific prompting as defined in [`translation_prompts.js`](https://github.com/openai/openai-cookbook/blob/main/translation_prompts.js).

## Implementation Examples

The following patterns demonstrate practical implementations from the cookbook source code.

### Initializing the Realtime Client

Create a WebSocket connection and configure the session voice and model parameters:

```typescript
import { RealtimeClient } from '@openai/realtime-api-beta';
export const OPENAI_API_KEY = process.env.REACT_APP_OPENAI_API_KEY;

const client = new RealtimeClient({
  apiKey: OPENAI_API_KEY,
  dangerouslyAllowAPIKeyInBrowser: true,
});

await client.realtime.connect({ model: "gpt-4o-realtime-preview-2024-12-17" });
await client.updateSession({ voice: "coral" });

```

*Source:* [`examples/voice_solutions/one_way_translation_using_realtime_api/src/pages/SpeakerPage.tsx`](https://github.com/openai/openai-cookbook/blob/main/examples/voice_solutions/one_way_translation_using_realtime_api/src/pages/SpeakerPage.tsx)

### Broadcasting Audio to Multiple Clients

Stream microphone input to all active language clients simultaneously:

```typescript
await wavRecorder.record((data) => {
  updatedLanguageConfigs.forEach(({ clientRef }) => {
    clientRef.current.appendInputAudio(data.mono);
  });
});

```

*Source:* [`examples/voice_solutions/one_way_translation_using_realtime_api/src/pages/SpeakerPage.tsx`](https://github.com/openai/openai-cookbook/blob/main/examples/voice_solutions/one_way_translation_using_realtime_api/src/pages/SpeakerPage.tsx)

### Processing Transcript Events

Handle completion events to extract final transcripts for UI display:

```typescript
if (ev.event.type == "response.audio_transcript.done") {
  setTranscripts(prev => [{ transcript: ev.event.transcript, language: languageCode }, ...prev]);
}

```

*Source:* [`examples/voice_solutions/one_way_translation_using_realtime_api/src/pages/SpeakerPage.tsx`](https://github.com/openai/openai-cookbook/blob/main/examples/voice_solutions/one_way_translation_using_realtime_api/src/pages/SpeakerPage.tsx)

### Accumulating Tool Calls in Python

For evaluation and testing, accumulate function calls from response streams:

```python
class ToolCallAccumulator:
    def handle_event_payload(self, payload: Dict[str, Any]) -> None:
        if payload.get("type") != "response.done":
            return
        for output_item in payload.get("response", {}).get("output", []):
            if output_item.get("type") == "function_call":
                call_id = output_item.get("call_id") or output_item.get("id") or ""
                entry = self._ensure_entry(call_id, output_item.get("name", ""))
                entry["raw_arguments"] = output_item.get("arguments", "")

```

*Source:* [`examples/evals/realtime_evals/shared/realtime_utils.py`](https://github.com/openai/openai-cookbook/blob/main/examples/evals/realtime_evals/shared/realtime_utils.py)

## Key Source Files

The following files contain the critical implementation details for building voice agents:

- [`examples/voice_solutions/running_realtime_api_speech_on_esp32_arduino_edge_runtime_elatoai.md`](https://github.com/openai/openai-cookbook/blob/main/examples/voice_solutions/running_realtime_api_speech_on_esp32_arduino_edge_runtime_elatoai.md) – Complete walkthrough of the ESP-32 + Deno edge architecture
- [`examples/voice_solutions/one_way_translation_using_realtime_api/src/pages/SpeakerPage.tsx`](https://github.com/openai/openai-cookbook/blob/main/examples/voice_solutions/one_way_translation_using_realtime_api/src/pages/SpeakerPage.tsx) – Core React component managing multiple RealtimeClients
- [`examples/voice_solutions/one_way_translation_using_realtime_api/src/utils/translation_prompts.js`](https://github.com/openai/openai-cookbook/blob/main/examples/voice_solutions/one_way_translation_using_realtime_api/src/utils/translation_prompts.js) – Language-specific system prompts
- [`examples/evals/realtime_evals/shared/realtime_utils.py`](https://github.com/openai/openai-cookbook/blob/main/examples/evals/realtime_evals/shared/realtime_utils.py) – Evaluation utilities and tool call accumulation
- `examples/voice_solutions/arduino_ai_speech_assets/flowchart.png` – Visual architecture diagram

## Extension Patterns

Customize the reference implementation for specific use cases using these established patterns.

### Adding Custom Characters and Languages

Extend [`translation_prompts.js`](https://github.com/openai/openai-cookbook/blob/main/translation_prompts.js) with new instruction sets, then add corresponding entries to the `languageConfigs` array in [`SpeakerPage.tsx`](https://github.com/openai/openai-cookbook/blob/main/SpeakerPage.tsx). Update [`ListenerPage.tsx`](https://github.com/openai/openai-cookbook/blob/main/ListenerPage.tsx) to expose the new options in the UI dropdown.

### Deploying Custom Edge Functions

Copy the [`server-deno/main.ts`](https://github.com/openai/openai-cookbook/blob/main/server-deno/main.ts) pattern to create new edge endpoints. Configure environment variables in `.env` and deploy using `deno run -A --env-file=.env main.ts`.

### Implementing Client-Side VAD

Replace the built-in `server_vad` with manual turn management by setting `client.updateSession({ turn_detection: null })` and controlling the `wavRecorder` pause/resume logic based on custom audio level thresholds.

## Summary

- The **OpenAI Cookbook** provides a complete hardware-to-frontend reference for Realtime API voice agents
- **ESP-32 devices** stream Opus-encoded audio through **Deno edge functions** to the Realtime API, with responses returned via WebSocket
- The **RealtimeClient** handles session configuration, audio streaming via `appendInputAudio()`, and event processing for transcripts
- **Server-side VAD** automatically manages turn detection, or implement manual control by disabling `turn_detection` in the session config
- **Python evaluation tools** in [`realtime_utils.py`](https://github.com/openai/openai-cookbook/blob/main/realtime_utils.py) enable deterministic testing of streaming behavior and tool calls

## Frequently Asked Questions

### How does the Realtime API handle turn detection in voice agents?

The Realtime API supports **server-side Voice Activity Detection (VAD)** through the `turn_detection` session parameter. When configured with `type: 'server_vad'`, the API automatically detects speech start and end points, triggering model responses without client-side silence detection. For custom behavior, set `turn_detection: null` and manually control audio streaming using `client.createResponse()`.

### Can I run the voice agent examples without ESP-32 hardware?

Yes. While the cookbook includes ESP-32 firmware for IoT deployment, the **React frontend** and **Deno edge functions** can operate independently using browser-based microphone input. The [`SpeakerPage.tsx`](https://github.com/openai/openai-cookbook/blob/main/SpeakerPage.tsx) component demonstrates web-only audio capture using the `wavRecorder` API, streaming directly to the Realtime API without physical hardware.

### What is the purpose of the ToolCallAccumulator class in the Python evaluation code?

The **ToolCallAccumulator** class in [`examples/evals/realtime_evals/shared/realtime_utils.py`](https://github.com/openai/openai-cookbook/blob/main/examples/evals/realtime_evals/shared/realtime_utils.py) captures and aggregates function calls from Realtime API event streams during testing. It processes `response.done` events to extract `function_call` output items, enabling deterministic validation of agent behavior in CI pipelines and research environments.

### How do I add support for additional languages in the translation example?

Add new language support by creating instruction prompts in [`translation_prompts.js`](https://github.com/openai/openai-cookbook/blob/main/translation_prompts.js), then registering the language in the `languageConfigs` array within [`SpeakerPage.tsx`](https://github.com/openai/openai-cookbook/blob/main/SpeakerPage.tsx). Each entry requires a language code, client reference, and instruction string. The existing architecture automatically creates isolated RealtimeClient instances for each language, broadcasting audio input to all configured clients simultaneously.