# How Symphony Restarts Stalled Issues: Automatic Recovery in OpenAI's Orchestrator

> Learn how Symphony restarts stalled issues automatically. Discover OpenAI's orchestrator self-healing capabilities and exponential back-off strategy for robust workflow execution.

- Repository: [OpenAI/symphony](https://github.com/openai/symphony)
- Tags: how-to-guide
- Published: 2026-05-08

---

**Symphony’s orchestration layer continuously monitors running issues and automatically restarts any that exceed a configurable stall timeout, applying exponential back-off to ensure self-healing workflow execution.**

The openai/symphony repository implements a robust orchestration engine in Elixir designed to manage long-running computational issues. When an issue stops reporting activity, Symphony treats it as stalled and triggers an automatic restart sequence to maintain pipeline health without manual intervention.

## How Stalled Issue Detection Works

Stall detection operates on every orchestration cycle through the main reconciliation loop defined in **[`elixir/lib/symphony_elixir/orchestrator.ex`](https://github.com/openai/symphony/blob/main/elixir/lib/symphony_elixir/orchestrator.ex)**.

### The Reconciliation Entry Point

The function **`reconcile_stalled_running_issues/1`** serves as the primary entry point for stall detection. On each tick of the orchestrator, this function iterates over `state.running` to evaluate idle duration. It retrieves the timeout threshold from **`Config.settings!().codex.stall_timeout_ms`** (defined in the config schema at lines 175-178) and compares it against actual elapsed idle time. Issues exceeding this threshold are passed to **`restart_stalled_issue/5`** (lines 48-66).

### Configuration: Setting the Stall Timeout

The stall timeout defaults to 5 minutes (300,000 ms) but is fully configurable through the Codex configuration schema. The orchestrator reads this value dynamically at runtime, allowing operators to tune sensitivity based on workload characteristics without redeploying code.

## Calculating Idle Time for Running Issues

Before restarting, Symphony computes precise idle duration using **`stall_elapsed_ms/2`** (lines 89-99).

This helper function examines the **`last_codex_timestamp`** field (or falls back to the issue start time) and calculates the difference in milliseconds using `DateTime.diff/3`. If no timestamp exists, the function returns `nil`, preventing false positives from triggering unnecessary restarts. This timestamp data originates from the tracker module, which records execution metrics for active issues.

## The Restart Flow: From Detection to Recovery

Once an issue is confirmed stalled, **`restart_stalled_issue/5`** (lines 67-84) executes a four-step recovery protocol:

1. **Log the stall event** – A warning is emitted containing the issue identifier and elapsed stall duration.
2. **Determine retry attempt** – **`next_retry_attempt_from_running/1`** calculates the appropriate back-off level for the restart.
3. **Terminate the stuck task** – **`terminate_running_issue/3`** forcefully stops the original task and triggers necessary workspace cleanup.
4. **Re-queue with back-off** – **`schedule_issue_retry/4`** creates a new execution attempt with an error payload describing the stall reason and applies exponential back-off before the next dispatch.

This sequence ensures that stalled issues are cleanly isolated and re-inserted into the pipeline with appropriate delay mechanisms.

## Configuring the Stall Timeout in Practice

You can adjust the stall detection sensitivity at runtime by updating the configuration struct:

```elixir

# Reduce stall timeout from default 5 minutes to 2 minutes

config = SymphonyElixir.Config.settings!()
updated_codex = Map.put(config.codex, :stall_timeout_ms, 2 * 60_000)

SymphonyElixir.Config.put_settings!(%{
  config |
  codex: updated_codex
})

```

In production deployments, these functions execute automatically within the orchestrator's main loop; manual invocation is typically unnecessary unless implementing custom monitoring tooling.

## Key Source Files for Issue Restart Logic

Understanding the restart mechanism requires familiarity with these core modules:

- **[`elixir/lib/symphony_elixir/orchestrator.ex`](https://github.com/openai/symphony/blob/main/elixir/lib/symphony_elixir/orchestrator.ex)** – Contains `reconcile_stalled_running_issues/1`, `stall_elapsed_ms/2`, and `restart_stalled_issue/5`, implementing the complete detection and recovery flow.
- **[`elixir/lib/symphony_elixir/config/schema.ex`](https://github.com/openai/symphony/blob/main/elixir/lib/symphony_elixir/config/schema.ex)** – Defines the `stall_timeout_ms` field (lines 175-178) used to configure detection sensitivity.
- **[`elixir/lib/symphony_elixir/config.ex`](https://github.com/openai/symphony/blob/main/elixir/lib/symphony_elixir/config.ex)** – Provides runtime access to configuration settings consumed by the orchestrator.
- **[`elixir/lib/symphony_elixir/tracker.ex`](https://github.com/openai/symphony/blob/main/elixir/lib/symphony_elixir/tracker.ex)** – Maintains execution timestamps including `last_codex_timestamp` used for idle time calculations.
- **[`elixir/lib/symphony_elixir/workspace.ex`](https://github.com/openai/symphony/blob/main/elixir/lib/symphony_elixir/workspace.ex)** – Handles resource cleanup operations invoked during task termination.

## Summary

Symphony implements automatic recovery for stalled issues through a deterministic three-phase process:

- **Detection** occurs via `reconcile_stalled_running_issues/1`, which compares elapsed idle time against `Config.settings!().codex.stall_timeout_ms` every orchestration cycle.
- **Calculation** uses `stall_elapsed_ms/2` to safely compute idle duration from `last_codex_timestamp`, preventing false restarts when data is missing.
- **Recovery** executes through `restart_stalled_issue/5`, which terminates the stuck task and re-queues it with exponential back-off and descriptive error context.

This architecture ensures that temporary hangs or silent failures do not block the orchestration pipeline indefinitely.

## Frequently Asked Questions

### What triggers a stalled issue restart in Symphony?

An issue restarts when the time elapsed since its last activity exceeds the `stall_timeout_ms` configuration value. The orchestrator evaluates this condition continuously through `reconcile_stalled_running_issues/1`, checking the `last_codex_timestamp` of every running issue against the current time.

### How does Symphony calculate if an issue is stalled?

Symphony uses the `stall_elapsed_ms/2` function to compute idle time. It takes the issue state and calculates the millisecond difference between the current time and the `last_codex_timestamp` (or start time if no activity recorded). If this duration exceeds the configured timeout, the issue qualifies for restart.

### Can I configure the stall timeout duration?

Yes. The timeout is configurable through `Config.settings!().codex.stall_timeout_ms`, defined in the schema at [`elixir/lib/symphony_elixir/config/schema.ex`](https://github.com/openai/symphony/blob/main/elixir/lib/symphony_elixir/config/schema.ex). You can modify this value at runtime using `SymphonyElixir.Config.put_settings!/1`, allowing you to tune detection sensitivity for different workload types without code changes.

### What happens to the original task when Symphony restarts a stalled issue?

The original task is forcefully terminated via `terminate_running_issue/3`, which handles cleanup of associated resources and workspace state. The orchestrator then creates a new execution attempt through `schedule_issue_retry/4`, applying exponential back-off to prevent immediate retry loops and preserving error context describing the stall condition.