Building Voice Agents with the OpenAI Realtime API: Architecture and Code Examples

The OpenAI Cookbook provides production-ready reference implementations for building voice agents using the Realtime API, featuring ESP-32 hardware integration, Deno edge runtime functions, and React frontends with server-side voice activity detection.

The openai/openai-cookbook repository contains a comprehensive suite of voice-agent examples demonstrating end-to-end speech-to-speech pipelines. These implementations showcase how to stream audio from IoT devices through edge servers to the Realtime API, handling authentication, turn detection, and multi-language translation in real time. This guide examines the architecture, core concepts, and production code patterns found in the official examples.

Architecture Overview

The reference implementation spans four distinct layers, each handling specific responsibilities in the audio pipeline from hardware capture to model inference.

IoT Device Layer (ESP-32)

At the edge, ESP-32 microcontrollers capture microphone audio using the Opus codec and transmit streams over secure WebSockets. The firmware handles device authentication and audio compression before transmission to the edge server. Configuration endpoints are defined in the Arduino source files within examples/voice_solutions/arduino_ai_speech_assets/.

Edge Server Layer (Deno/Supabase)

A Deno-based edge function running on Supabase Edge Runtime authenticates devices and proxies audio to OpenAI's servers. This layer, documented in examples/voice_solutions/running_realtime_api_speech_on_esp32_arduino_edge_runtime_elatoai.md, manages environment variables including OPENAI_API_KEY and SUPABASE_ANON_KEY, while implementing Voice Activity Detection (VAD) turn detection and optional encryption.

Frontend Layer (React/Next.js)

The React application in examples/voice_solutions/one_way_translation_using_realtime_api/ provides the configuration interface and audio streaming UI. It utilizes the @openai/realtime-api-beta client library to manage multiple simultaneous language connections via Socket.io, as implemented in src/pages/SpeakerPage.tsx.

Evaluation Layer (Python)

For testing and CI integration, the Python evaluation harness in examples/evals/realtime_evals/shared/realtime_utils.py provides deterministic replay tests. The ToolCallAccumulator class captures and validates function calls within Realtime API event streams.

Core Realtime API Concepts

Understanding these fundamental patterns is essential for building responsive voice agents with low-latency audio streaming.

RealtimeClient Initialization

The RealtimeClient class wraps the WebSocket protocol, handling connection management and session configuration. Each client instance maintains an independent audio stream and instruction set. The cookbook demonstrates creating multiple clients for simultaneous translation scenarios.

Server-Side Voice Activity Detection (VAD)

The Realtime API supports server-side VAD via the turn_detection parameter. When enabled with type: 'server_vad', the API automatically detects speech boundaries, eliminating the need for manual recording controls. The changeTurnEndType function in the React examples toggles this behavior via client.updateSession({ turn_detection: … }).

Audio Streaming and Session Management

Audio flows as PCM frames through client.appendInputAudio(), while client.createResponse() signals the model to process accumulated audio and generate a response. System instructions are injected per client using client.updateSession({ instructions }), allowing language-specific prompting as defined in translation_prompts.js.

Implementation Examples

The following patterns demonstrate practical implementations from the cookbook source code.

Initializing the Realtime Client

Create a WebSocket connection and configure the session voice and model parameters:

import { RealtimeClient } from '@openai/realtime-api-beta';
export const OPENAI_API_KEY = process.env.REACT_APP_OPENAI_API_KEY;

const client = new RealtimeClient({
  apiKey: OPENAI_API_KEY,
  dangerouslyAllowAPIKeyInBrowser: true,
});

await client.realtime.connect({ model: "gpt-4o-realtime-preview-2024-12-17" });
await client.updateSession({ voice: "coral" });

Source: examples/voice_solutions/one_way_translation_using_realtime_api/src/pages/SpeakerPage.tsx

Broadcasting Audio to Multiple Clients

Stream microphone input to all active language clients simultaneously:

await wavRecorder.record((data) => {
  updatedLanguageConfigs.forEach(({ clientRef }) => {
    clientRef.current.appendInputAudio(data.mono);
  });
});

Source: examples/voice_solutions/one_way_translation_using_realtime_api/src/pages/SpeakerPage.tsx

Processing Transcript Events

Handle completion events to extract final transcripts for UI display:

if (ev.event.type == "response.audio_transcript.done") {
  setTranscripts(prev => [{ transcript: ev.event.transcript, language: languageCode }, ...prev]);
}

Source: examples/voice_solutions/one_way_translation_using_realtime_api/src/pages/SpeakerPage.tsx

Accumulating Tool Calls in Python

For evaluation and testing, accumulate function calls from response streams:

class ToolCallAccumulator:
    def handle_event_payload(self, payload: Dict[str, Any]) -> None:
        if payload.get("type") != "response.done":
            return
        for output_item in payload.get("response", {}).get("output", []):
            if output_item.get("type") == "function_call":
                call_id = output_item.get("call_id") or output_item.get("id") or ""
                entry = self._ensure_entry(call_id, output_item.get("name", ""))
                entry["raw_arguments"] = output_item.get("arguments", "")

Source: examples/evals/realtime_evals/shared/realtime_utils.py

Key Source Files

The following files contain the critical implementation details for building voice agents:

Extension Patterns

Customize the reference implementation for specific use cases using these established patterns.

Adding Custom Characters and Languages

Extend translation_prompts.js with new instruction sets, then add corresponding entries to the languageConfigs array in SpeakerPage.tsx. Update ListenerPage.tsx to expose the new options in the UI dropdown.

Deploying Custom Edge Functions

Copy the server-deno/main.ts pattern to create new edge endpoints. Configure environment variables in .env and deploy using deno run -A --env-file=.env main.ts.

Implementing Client-Side VAD

Replace the built-in server_vad with manual turn management by setting client.updateSession({ turn_detection: null }) and controlling the wavRecorder pause/resume logic based on custom audio level thresholds.

Summary

  • The OpenAI Cookbook provides a complete hardware-to-frontend reference for Realtime API voice agents
  • ESP-32 devices stream Opus-encoded audio through Deno edge functions to the Realtime API, with responses returned via WebSocket
  • The RealtimeClient handles session configuration, audio streaming via appendInputAudio(), and event processing for transcripts
  • Server-side VAD automatically manages turn detection, or implement manual control by disabling turn_detection in the session config
  • Python evaluation tools in realtime_utils.py enable deterministic testing of streaming behavior and tool calls

Frequently Asked Questions

How does the Realtime API handle turn detection in voice agents?

The Realtime API supports server-side Voice Activity Detection (VAD) through the turn_detection session parameter. When configured with type: 'server_vad', the API automatically detects speech start and end points, triggering model responses without client-side silence detection. For custom behavior, set turn_detection: null and manually control audio streaming using client.createResponse().

Can I run the voice agent examples without ESP-32 hardware?

Yes. While the cookbook includes ESP-32 firmware for IoT deployment, the React frontend and Deno edge functions can operate independently using browser-based microphone input. The SpeakerPage.tsx component demonstrates web-only audio capture using the wavRecorder API, streaming directly to the Realtime API without physical hardware.

What is the purpose of the ToolCallAccumulator class in the Python evaluation code?

The ToolCallAccumulator class in examples/evals/realtime_evals/shared/realtime_utils.py captures and aggregates function calls from Realtime API event streams during testing. It processes response.done events to extract function_call output items, enabling deterministic validation of agent behavior in CI pipelines and research environments.

How do I add support for additional languages in the translation example?

Add new language support by creating instruction prompts in translation_prompts.js, then registering the language in the languageConfigs array within SpeakerPage.tsx. Each entry requires a language code, client reference, and instruction string. The existing architecture automatically creates isolated RealtimeClient instances for each language, broadcasting audio input to all configured clients simultaneously.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →