how-to-guide

Building Voice Agents with the OpenAI Realtime API: Architecture and Code Examples

March 2, 2026 openai/openai-cookbook ↗

The OpenAI Cookbook provides production-ready reference implementations for building voice agents using the Realtime API, featuring ESP-32 hardware integration, Deno edge runtime functions, and React frontends with server-side voice activity detection.

The openai/openai-cookbook repository contains a comprehensive suite of voice-agent examples demonstrating end-to-end speech-to-speech pipelines. These implementations showcase how to stream audio from IoT devices through edge servers to the Realtime API, handling authentication, turn detection, and multi-language translation in real time. This guide examines the architecture, core concepts, and production code patterns found in the official examples.

Architecture Overview

The reference implementation spans four distinct layers, each handling specific responsibilities in the audio pipeline from hardware capture to model inference.

IoT Device Layer (ESP-32)

At the edge, ESP-32 microcontrollers capture microphone audio using the Opus codec and transmit streams over secure WebSockets. The firmware handles device authentication and audio compression before transmission to the edge server. Configuration endpoints are defined in the Arduino source files within examples/voice_solutions/arduino_ai_speech_assets/.

Edge Server Layer (Deno/Supabase)

A Deno-based edge function running on Supabase Edge Runtime authenticates devices and proxies audio to OpenAI's servers. This layer, documented in examples/voice_solutions/running_realtime_api_speech_on_esp32_arduino_edge_runtime_elatoai.md, manages environment variables including OPENAI_API_KEY and SUPABASE_ANON_KEY, while implementing Voice Activity Detection (VAD) turn detection and optional encryption.

Frontend Layer (React/Next.js)

The React application in examples/voice_solutions/one_way_translation_using_realtime_api/ provides the configuration interface and audio streaming UI. It utilizes the @openai/realtime-api-beta client library to manage multiple simultaneous language connections via Socket.io, as implemented in src/pages/SpeakerPage.tsx.

Evaluation Layer (Python)

For testing and CI integration, the Python evaluation harness in examples/evals/realtime_evals/shared/realtime_utils.py provides deterministic replay tests. The ToolCallAccumulator class captures and validates function calls within Realtime API event streams.

Core Realtime API Concepts

Understanding these fundamental patterns is essential for building responsive voice agents with low-latency audio streaming.

RealtimeClient Initialization

The RealtimeClient class wraps the WebSocket protocol, handling connection management and session configuration. Each client instance maintains an independent audio stream and instruction set. The cookbook demonstrates creating multiple clients for simultaneous translation scenarios.

Server-Side Voice Activity Detection (VAD)

The Realtime API supports server-side VAD via the turn_detection parameter. When enabled with type: 'server_vad', the API automatically detects speech boundaries, eliminating the need for manual recording controls. The changeTurnEndType function in the React examples toggles this behavior via client.updateSession({ turn_detection: … }).

Audio Streaming and Session Management

Audio flows as PCM frames through client.appendInputAudio(), while client.createResponse() signals the model to process accumulated audio and generate a response. System instructions are injected per client using client.updateSession({ instructions }), allowing language-specific prompting as defined in translation_prompts.js.

Implementation Examples

The following patterns demonstrate practical implementations from the cookbook source code.

Initializing the Realtime Client

Create a WebSocket connection and configure the session voice and model parameters:

import { RealtimeClient } from '@openai/realtime-api-beta';
export const OPENAI_API_KEY = process.env.REACT_APP_OPENAI_API_KEY;

const client = new RealtimeClient({
  apiKey: OPENAI_API_KEY,
  dangerouslyAllowAPIKeyInBrowser: true,
});

await client.realtime.connect({ model: "gpt-4o-realtime-preview-2024-12-17" });
await client.updateSession({ voice: "coral" });

Source: examples/voice_solutions/one_way_translation_using_realtime_api/src/pages/SpeakerPage.tsx

Broadcasting Audio to Multiple Clients

Stream microphone input to all active language clients simultaneously:

await wavRecorder.record((data) => {
  updatedLanguageConfigs.forEach(({ clientRef }) => {
    clientRef.current.appendInputAudio(data.mono);
  });
});

Source: examples/voice_solutions/one_way_translation_using_realtime_api/src/pages/SpeakerPage.tsx

Processing Transcript Events

Handle completion events to extract final transcripts for UI display:

if (ev.event.type == "response.audio_transcript.done") {
  setTranscripts(prev => [{ transcript: ev.event.transcript, language: languageCode }, ...prev]);
}

Source: examples/voice_solutions/one_way_translation_using_realtime_api/src/pages/SpeakerPage.tsx

Accumulating Tool Calls in Python

For evaluation and testing, accumulate function calls from response streams:

class ToolCallAccumulator:
    def handle_event_payload(self, payload: Dict[str, Any]) -> None:
        if payload.get("type") != "response.done":
            return
        for output_item in payload.get("response", {}).get("output", []):
            if output_item.get("type") == "function_call":
                call_id = output_item.get("call_id") or output_item.get("id") or ""
                entry = self._ensure_entry(call_id, output_item.get("name", ""))
                entry["raw_arguments"] = output_item.get("arguments", "")

Source: examples/evals/realtime_evals/shared/realtime_utils.py

Key Source Files

The following files contain the critical implementation details for building voice agents:

examples/voice_solutions/running_realtime_api_speech_on_esp32_arduino_edge_runtime_elatoai.md – Complete walkthrough of the ESP-32 + Deno edge architecture
examples/voice_solutions/one_way_translation_using_realtime_api/src/pages/SpeakerPage.tsx – Core React component managing multiple RealtimeClients
examples/voice_solutions/one_way_translation_using_realtime_api/src/utils/translation_prompts.js – Language-specific system prompts
examples/evals/realtime_evals/shared/realtime_utils.py – Evaluation utilities and tool call accumulation
examples/voice_solutions/arduino_ai_speech_assets/flowchart.png – Visual architecture diagram

Extension Patterns

Customize the reference implementation for specific use cases using these established patterns.

Adding Custom Characters and Languages

Extend translation_prompts.js with new instruction sets, then add corresponding entries to the languageConfigs array in SpeakerPage.tsx. Update ListenerPage.tsx to expose the new options in the UI dropdown.

Deploying Custom Edge Functions

Copy the server-deno/main.ts pattern to create new edge endpoints. Configure environment variables in .env and deploy using deno run -A --env-file=.env main.ts.

Implementing Client-Side VAD

Replace the built-in server_vad with manual turn management by setting client.updateSession({ turn_detection: null }) and controlling the wavRecorder pause/resume logic based on custom audio level thresholds.

Summary

The OpenAI Cookbook provides a complete hardware-to-frontend reference for Realtime API voice agents
ESP-32 devices stream Opus-encoded audio through Deno edge functions to the Realtime API, with responses returned via WebSocket
The RealtimeClient handles session configuration, audio streaming via appendInputAudio(), and event processing for transcripts
Server-side VAD automatically manages turn detection, or implement manual control by disabling turn_detection in the session config
Python evaluation tools in realtime_utils.py enable deterministic testing of streaming behavior and tool calls

Frequently Asked Questions

How does the Realtime API handle turn detection in voice agents?

The Realtime API supports server-side Voice Activity Detection (VAD) through the turn_detection session parameter. When configured with type: 'server_vad', the API automatically detects speech start and end points, triggering model responses without client-side silence detection. For custom behavior, set turn_detection: null and manually control audio streaming using client.createResponse().

Can I run the voice agent examples without ESP-32 hardware?

Yes. While the cookbook includes ESP-32 firmware for IoT deployment, the React frontend and Deno edge functions can operate independently using browser-based microphone input. The SpeakerPage.tsx component demonstrates web-only audio capture using the wavRecorder API, streaming directly to the Realtime API without physical hardware.

What is the purpose of the ToolCallAccumulator class in the Python evaluation code?

The ToolCallAccumulator class in examples/evals/realtime_evals/shared/realtime_utils.py captures and aggregates function calls from Realtime API event streams during testing. It processes response.done events to extract function_call output items, enabling deterministic validation of agent behavior in CI pipelines and research environments.

How do I add support for additional languages in the translation example?

Add new language support by creating instruction prompts in translation_prompts.js, then registering the language in the languageConfigs array within SpeakerPage.tsx. Each entry requires a language code, client reference, and instruction string. The existing architecture automatically creates isolated RealtimeClient instances for each language, broadcasting audio input to all configured clients simultaneously.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how openai/openai-cookbook works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →