Symphony Retry Logic for Failed Agent Runs: Exponential Backoff Implementation

Symphony implements an exponential backoff retry mechanism with a configurable ceiling that detects worker crashes via Elixir process monitoring, calculates delays starting at 10 seconds and doubling with each attempt, and preserves full execution context to resume agent runs idempotently.

OpenAI's Symphony orchestrates AI agents processing Linear issues through Codex-backed workers. When these agents fail unexpectedly, the system treats each agent run as a resumable unit, implementing a robust retry logic for failed agent runs that automatically recovers from crashes while maintaining execution context.

Detecting Agent Failures in the Orchestrator

Symphony's orchestrator monitors every dispatched agent using Elixir's process monitoring. When a worker process exits abnormally, the orchestrator's handle_info/2 function in elixir/lib/symphony_elixir/orchestrator.ex receives a {:DOWN, ref, :process, _pid, reason} message.

The orchestrator locates the associated issue using find_issue_id_for_ref/2 and extracts the running entry via pop_running_entry/2. It then determines the next attempt count by calling next_retry_attempt_from_running/1:

def handle_info({:DOWN, ref, :process, _pid, reason}, %{running: running} = state) do
  case find_issue_id_for_ref(running, ref) do
    nil -> {:noreply, state}
    issue_id ->
      {running_entry, state} = pop_running_entry(state, issue_id)
      next_attempt = next_retry_attempt_from_running(running_entry)
      
      schedule_issue_retry(state, issue_id, next_attempt, %{
        identifier: running_entry.identifier,
        error: "agent exited: #{inspect(reason)}",
        worker_host: Map.get(running_entry, :worker_host),
        workspace_path: Map.get(running_entry, :workspace_path)
      })
  end
end

Calculating Retry Delays

The delay calculation distinguishes between successful continuations and actual failures. This logic resides in retry_delay/2 and failure_retry_delay/1 within the orchestrator.

Continuation vs. Failure Backoff

Continuation retries occur when a task completed successfully but the orchestrator needs to poll for the next step. These use a fixed 1-second delay defined by @continuation_retry_delay_ms, but only on the first attempt.

Failure retries apply exponential backoff starting from @failure_retry_base_ms (10 seconds):

defp retry_delay(attempt, metadata) when is_integer(attempt) and attempt > 0 do
  if metadata[:delay_type] == :continuation and attempt == 1 do
    @continuation_retry_delay_ms
  else
    failure_retry_delay(attempt)
  end
end

defp failure_retry_delay(attempt) do
  max_delay_power = min(attempt - 1, 10)
  min(@failure_retry_base_ms * (1 <<< max_delay_power),
      Config.settings!().agent.max_retry_backoff_ms)
end

Exponential Backoff Formula

The failure calculation uses bit-shifting (1 <<< max_delay_power) to compute 2^(attempt-1), but caps the exponent at 10 to prevent overflow. The result is multiplied by the 10-second base and clamped against the user-configured maximum from Config.settings!().agent.max_retry_backoff_ms.

Scheduling Retries with Context Preservation

The schedule_issue_retry/4 function records comprehensive metadata in state.retry_attempts before triggering the delay. This map stores the attempt number, timer reference, unique retry token, due timestamp, and execution context including the identifier, error string, worker host, and workspace path:

defp schedule_issue_retry(%State{} = state, issue_id, attempt, metadata) do
  delay_ms = retry_delay(attempt, metadata)
  timer_ref = Process.send_after(self(), {:retry_issue, issue_id, make_ref()}, delay_ms)

  %{state |
    retry_attempts: Map.put(state.retry_attempts, issue_id, %{
      attempt: attempt,
      timer_ref: timer_ref,
      retry_token: make_ref(),
      due_at_ms: System.monotonic_time(:millisecond) + delay_ms,
      identifier: pick_retry_identifier(issue_id, %{}, metadata),
      error: pick_retry_error(%{}, metadata),
      worker_host: pick_retry_worker_host(%{}, metadata),
      workspace_path: pick_retry_workspace_path(%{}, metadata)
    })}
end

Executing Retried Agent Runs

When the timer fires, the orchestrator receives a {:retry_issue, issue_id, retry_token} message. The handle_info/2 clause validates the token against pop_retry_attempt_state/3 to prevent duplicate processing, then either re-dispatches the issue or schedules another retry if no worker slots are available:

def handle_info({:retry_issue, issue_id, retry_token}, state) do
  case pop_retry_attempt_state(state, issue_id, retry_token) do
    {:ok, attempt, metadata, state} ->
      handle_retry_issue(state, issue_id, attempt, metadata)
    :missing -> {:noreply, state}
  end
end

Configuration Limits

The retry mechanism respects bounds defined in elixir/config/config.ex. The agent.max_retry_backoff_ms setting hard-caps the exponential delay, while the internal implementation limits the exponent calculation to 10 steps (max_delay_power = min(attempt - 1, 10)), ensuring the delay never exceeds approximately 2.8 hours (10 seconds × 2^10) before clamping.

Summary

  • Failure detection occurs through Elixir process monitoring (:DOWN messages) in handle_info/2.
  • Delay calculation uses exponential backoff starting at 10 seconds, doubling with each attempt, but capped at a configurable maximum via Config.settings!().agent.max_retry_backoff_ms.
  • Context preservation maintains identifier, error details, worker host, and workspace path across retries via state.retry_attempts.
  • Timing utilizes Process.send_after/3 with unique retry tokens to ensure idempotent execution.
  • Continuation vs. failure logic distinguishes between successful polling (1-second delay) and actual crashes (exponential backoff).

Frequently Asked Questions

What triggers Symphony's retry mechanism?

When a Codex-backed worker process crashes or exits unexpectedly while processing a Linear issue, the orchestrator receives a {:DOWN, ...} message. This triggers the retry logic, which calculates the next backoff delay and schedules a retry attempt, preserving the original execution context including workspace paths and error details.

How does Symphony calculate exponential backoff delays?

Symphony calculates delays using failure_retry_delay/1, which applies the formula min(10s * 2^(attempt-1), max_configured_backoff). The exponent is capped at 10 to prevent integer overflow, and the final delay cannot exceed the agent.max_retry_backoff_ms configuration value. First-attempt continuations receive a fixed 1-second delay instead.

What metadata is preserved during a retry?

The orchestrator preserves the identifier (for correlation), the error string (for debugging), the worker_host (for host affinity), and the workspace_path (for filesystem context). This metadata is stored in the state.retry_attempts map under the issue ID, ensuring resumed agents can continue execution from the exact previous state.

Is there a maximum retry delay in Symphony?

Yes. While the exponential calculation uses powers of 2, Symphony caps the exponent at 10 (preventing delays beyond ~2.8 hours from the base) and applies a hard ceiling via Config.settings!().agent.max_retry_backoff_ms. This jitter-free, bounded backoff prevents infinite retry loops and excessive resource waiting.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →