How Symphony Restarts Stalled Issues: Automatic Recovery in OpenAI's Orchestrator

Symphony’s orchestration layer continuously monitors running issues and automatically restarts any that exceed a configurable stall timeout, applying exponential back-off to ensure self-healing workflow execution.

The openai/symphony repository implements a robust orchestration engine in Elixir designed to manage long-running computational issues. When an issue stops reporting activity, Symphony treats it as stalled and triggers an automatic restart sequence to maintain pipeline health without manual intervention.

How Stalled Issue Detection Works

Stall detection operates on every orchestration cycle through the main reconciliation loop defined in elixir/lib/symphony_elixir/orchestrator.ex.

The Reconciliation Entry Point

The function reconcile_stalled_running_issues/1 serves as the primary entry point for stall detection. On each tick of the orchestrator, this function iterates over state.running to evaluate idle duration. It retrieves the timeout threshold from Config.settings!().codex.stall_timeout_ms (defined in the config schema at lines 175-178) and compares it against actual elapsed idle time. Issues exceeding this threshold are passed to restart_stalled_issue/5 (lines 48-66).

Configuration: Setting the Stall Timeout

The stall timeout defaults to 5 minutes (300,000 ms) but is fully configurable through the Codex configuration schema. The orchestrator reads this value dynamically at runtime, allowing operators to tune sensitivity based on workload characteristics without redeploying code.

Calculating Idle Time for Running Issues

Before restarting, Symphony computes precise idle duration using stall_elapsed_ms/2 (lines 89-99).

This helper function examines the last_codex_timestamp field (or falls back to the issue start time) and calculates the difference in milliseconds using DateTime.diff/3. If no timestamp exists, the function returns nil, preventing false positives from triggering unnecessary restarts. This timestamp data originates from the tracker module, which records execution metrics for active issues.

The Restart Flow: From Detection to Recovery

Once an issue is confirmed stalled, restart_stalled_issue/5 (lines 67-84) executes a four-step recovery protocol:

  1. Log the stall event – A warning is emitted containing the issue identifier and elapsed stall duration.
  2. Determine retry attemptnext_retry_attempt_from_running/1 calculates the appropriate back-off level for the restart.
  3. Terminate the stuck taskterminate_running_issue/3 forcefully stops the original task and triggers necessary workspace cleanup.
  4. Re-queue with back-offschedule_issue_retry/4 creates a new execution attempt with an error payload describing the stall reason and applies exponential back-off before the next dispatch.

This sequence ensures that stalled issues are cleanly isolated and re-inserted into the pipeline with appropriate delay mechanisms.

Configuring the Stall Timeout in Practice

You can adjust the stall detection sensitivity at runtime by updating the configuration struct:


# Reduce stall timeout from default 5 minutes to 2 minutes

config = SymphonyElixir.Config.settings!()
updated_codex = Map.put(config.codex, :stall_timeout_ms, 2 * 60_000)

SymphonyElixir.Config.put_settings!(%{
  config |
  codex: updated_codex
})

In production deployments, these functions execute automatically within the orchestrator's main loop; manual invocation is typically unnecessary unless implementing custom monitoring tooling.

Key Source Files for Issue Restart Logic

Understanding the restart mechanism requires familiarity with these core modules:

Summary

Symphony implements automatic recovery for stalled issues through a deterministic three-phase process:

  • Detection occurs via reconcile_stalled_running_issues/1, which compares elapsed idle time against Config.settings!().codex.stall_timeout_ms every orchestration cycle.
  • Calculation uses stall_elapsed_ms/2 to safely compute idle duration from last_codex_timestamp, preventing false restarts when data is missing.
  • Recovery executes through restart_stalled_issue/5, which terminates the stuck task and re-queues it with exponential back-off and descriptive error context.

This architecture ensures that temporary hangs or silent failures do not block the orchestration pipeline indefinitely.

Frequently Asked Questions

What triggers a stalled issue restart in Symphony?

An issue restarts when the time elapsed since its last activity exceeds the stall_timeout_ms configuration value. The orchestrator evaluates this condition continuously through reconcile_stalled_running_issues/1, checking the last_codex_timestamp of every running issue against the current time.

How does Symphony calculate if an issue is stalled?

Symphony uses the stall_elapsed_ms/2 function to compute idle time. It takes the issue state and calculates the millisecond difference between the current time and the last_codex_timestamp (or start time if no activity recorded). If this duration exceeds the configured timeout, the issue qualifies for restart.

Can I configure the stall timeout duration?

Yes. The timeout is configurable through Config.settings!().codex.stall_timeout_ms, defined in the schema at elixir/lib/symphony_elixir/config/schema.ex. You can modify this value at runtime using SymphonyElixir.Config.put_settings!/1, allowing you to tune detection sensitivity for different workload types without code changes.

What happens to the original task when Symphony restarts a stalled issue?

The original task is forcefully terminated via terminate_running_issue/3, which handles cleanup of associated resources and workspace state. The orchestrator then creates a new execution attempt through schedule_issue_retry/4, applying exponential back-off to prevent immediate retry loops and preserving error context describing the stall condition.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →