# Symphony Retry Logic for Failed Agent Runs: Exponential Backoff Implementation

> Symphony employs exponential backoff for failed agent runs, using Elixir process monitoring and preserving context for idempotent retries. Learn about its robust retry logic.

- Repository: [OpenAI/symphony](https://github.com/openai/symphony)
- Tags: how-to-guide
- Published: 2026-05-08

---

**Symphony implements an exponential backoff retry mechanism with a configurable ceiling that detects worker crashes via Elixir process monitoring, calculates delays starting at 10 seconds and doubling with each attempt, and preserves full execution context to resume agent runs idempotently.**

OpenAI's Symphony orchestrates AI agents processing Linear issues through Codex-backed workers. When these agents fail unexpectedly, the system treats each **agent run** as a resumable unit, implementing a robust retry logic for failed agent runs that automatically recovers from crashes while maintaining execution context.

## Detecting Agent Failures in the Orchestrator

Symphony's orchestrator monitors every dispatched agent using Elixir's process monitoring. When a worker process exits abnormally, the orchestrator's `handle_info/2` function in [`elixir/lib/symphony_elixir/orchestrator.ex`](https://github.com/openai/symphony/blob/main/elixir/lib/symphony_elixir/orchestrator.ex) receives a `{:DOWN, ref, :process, _pid, reason}` message.

The orchestrator locates the associated issue using `find_issue_id_for_ref/2` and extracts the running entry via `pop_running_entry/2`. It then determines the next attempt count by calling `next_retry_attempt_from_running/1`:

```elixir
def handle_info({:DOWN, ref, :process, _pid, reason}, %{running: running} = state) do
  case find_issue_id_for_ref(running, ref) do
    nil -> {:noreply, state}
    issue_id ->
      {running_entry, state} = pop_running_entry(state, issue_id)
      next_attempt = next_retry_attempt_from_running(running_entry)
      
      schedule_issue_retry(state, issue_id, next_attempt, %{
        identifier: running_entry.identifier,
        error: "agent exited: #{inspect(reason)}",
        worker_host: Map.get(running_entry, :worker_host),
        workspace_path: Map.get(running_entry, :workspace_path)
      })
  end
end

```

## Calculating Retry Delays

The delay calculation distinguishes between successful continuations and actual failures. This logic resides in `retry_delay/2` and `failure_retry_delay/1` within the orchestrator.

### Continuation vs. Failure Backoff

**Continuation retries** occur when a task completed successfully but the orchestrator needs to poll for the next step. These use a fixed **1-second delay** defined by `@continuation_retry_delay_ms`, but only on the first attempt.

**Failure retries** apply exponential backoff starting from `@failure_retry_base_ms` (10 seconds):

```elixir
defp retry_delay(attempt, metadata) when is_integer(attempt) and attempt > 0 do
  if metadata[:delay_type] == :continuation and attempt == 1 do
    @continuation_retry_delay_ms
  else
    failure_retry_delay(attempt)
  end
end

defp failure_retry_delay(attempt) do
  max_delay_power = min(attempt - 1, 10)
  min(@failure_retry_base_ms * (1 <<< max_delay_power),
      Config.settings!().agent.max_retry_backoff_ms)
end

```

### Exponential Backoff Formula

The failure calculation uses bit-shifting (`1 <<< max_delay_power`) to compute `2^(attempt-1)`, but caps the exponent at **10** to prevent overflow. The result is multiplied by the 10-second base and clamped against the user-configured maximum from `Config.settings!().agent.max_retry_backoff_ms`.

## Scheduling Retries with Context Preservation

The `schedule_issue_retry/4` function records comprehensive metadata in `state.retry_attempts` before triggering the delay. This map stores the attempt number, timer reference, unique retry token, due timestamp, and execution context including the identifier, error string, worker host, and workspace path:

```elixir
defp schedule_issue_retry(%State{} = state, issue_id, attempt, metadata) do
  delay_ms = retry_delay(attempt, metadata)
  timer_ref = Process.send_after(self(), {:retry_issue, issue_id, make_ref()}, delay_ms)

  %{state |
    retry_attempts: Map.put(state.retry_attempts, issue_id, %{
      attempt: attempt,
      timer_ref: timer_ref,
      retry_token: make_ref(),
      due_at_ms: System.monotonic_time(:millisecond) + delay_ms,
      identifier: pick_retry_identifier(issue_id, %{}, metadata),
      error: pick_retry_error(%{}, metadata),
      worker_host: pick_retry_worker_host(%{}, metadata),
      workspace_path: pick_retry_workspace_path(%{}, metadata)
    })}
end

```

## Executing Retried Agent Runs

When the timer fires, the orchestrator receives a `{:retry_issue, issue_id, retry_token}` message. The `handle_info/2` clause validates the token against `pop_retry_attempt_state/3` to prevent duplicate processing, then either re-dispatches the issue or schedules another retry if no worker slots are available:

```elixir
def handle_info({:retry_issue, issue_id, retry_token}, state) do
  case pop_retry_attempt_state(state, issue_id, retry_token) do
    {:ok, attempt, metadata, state} ->
      handle_retry_issue(state, issue_id, attempt, metadata)
    :missing -> {:noreply, state}
  end
end

```

## Configuration Limits

The retry mechanism respects bounds defined in [`elixir/config/config.ex`](https://github.com/openai/symphony/blob/main/elixir/config/config.ex). The `agent.max_retry_backoff_ms` setting hard-caps the exponential delay, while the internal implementation limits the exponent calculation to 10 steps (`max_delay_power = min(attempt - 1, 10)`), ensuring the delay never exceeds approximately 2.8 hours (10 seconds × 2^10) before clamping.

## Summary

- **Failure detection** occurs through Elixir process monitoring (`:DOWN` messages) in `handle_info/2`.
- **Delay calculation** uses exponential backoff starting at 10 seconds, doubling with each attempt, but capped at a configurable maximum via `Config.settings!().agent.max_retry_backoff_ms`.
- **Context preservation** maintains identifier, error details, worker host, and workspace path across retries via `state.retry_attempts`.
- **Timing** utilizes `Process.send_after/3` with unique retry tokens to ensure idempotent execution.
- **Continuation vs. failure** logic distinguishes between successful polling (1-second delay) and actual crashes (exponential backoff).

## Frequently Asked Questions

### What triggers Symphony's retry mechanism?

When a Codex-backed worker process crashes or exits unexpectedly while processing a Linear issue, the orchestrator receives a `{:DOWN, ...}` message. This triggers the retry logic, which calculates the next backoff delay and schedules a retry attempt, preserving the original execution context including workspace paths and error details.

### How does Symphony calculate exponential backoff delays?

Symphony calculates delays using `failure_retry_delay/1`, which applies the formula `min(10s * 2^(attempt-1), max_configured_backoff)`. The exponent is capped at 10 to prevent integer overflow, and the final delay cannot exceed the `agent.max_retry_backoff_ms` configuration value. First-attempt continuations receive a fixed 1-second delay instead.

### What metadata is preserved during a retry?

The orchestrator preserves the `identifier` (for correlation), the `error` string (for debugging), the `worker_host` (for host affinity), and the `workspace_path` (for filesystem context). This metadata is stored in the `state.retry_attempts` map under the issue ID, ensuring resumed agents can continue execution from the exact previous state.

### Is there a maximum retry delay in Symphony?

Yes. While the exponential calculation uses powers of 2, Symphony caps the exponent at 10 (preventing delays beyond ~2.8 hours from the base) and applies a hard ceiling via `Config.settings!().agent.max_retry_backoff_ms`. This jitter-free, bounded backoff prevents infinite retry loops and excessive resource waiting.