How Symphony Restarts Stalled Issues: Automatic Recovery in OpenAI's Orchestrator
Symphony’s orchestration layer continuously monitors running issues and automatically restarts any that exceed a configurable stall timeout, applying exponential back-off to ensure self-healing workflow execution.
The openai/symphony repository implements a robust orchestration engine in Elixir designed to manage long-running computational issues. When an issue stops reporting activity, Symphony treats it as stalled and triggers an automatic restart sequence to maintain pipeline health without manual intervention.
How Stalled Issue Detection Works
Stall detection operates on every orchestration cycle through the main reconciliation loop defined in elixir/lib/symphony_elixir/orchestrator.ex.
The Reconciliation Entry Point
The function reconcile_stalled_running_issues/1 serves as the primary entry point for stall detection. On each tick of the orchestrator, this function iterates over state.running to evaluate idle duration. It retrieves the timeout threshold from Config.settings!().codex.stall_timeout_ms (defined in the config schema at lines 175-178) and compares it against actual elapsed idle time. Issues exceeding this threshold are passed to restart_stalled_issue/5 (lines 48-66).
Configuration: Setting the Stall Timeout
The stall timeout defaults to 5 minutes (300,000 ms) but is fully configurable through the Codex configuration schema. The orchestrator reads this value dynamically at runtime, allowing operators to tune sensitivity based on workload characteristics without redeploying code.
Calculating Idle Time for Running Issues
Before restarting, Symphony computes precise idle duration using stall_elapsed_ms/2 (lines 89-99).
This helper function examines the last_codex_timestamp field (or falls back to the issue start time) and calculates the difference in milliseconds using DateTime.diff/3. If no timestamp exists, the function returns nil, preventing false positives from triggering unnecessary restarts. This timestamp data originates from the tracker module, which records execution metrics for active issues.
The Restart Flow: From Detection to Recovery
Once an issue is confirmed stalled, restart_stalled_issue/5 (lines 67-84) executes a four-step recovery protocol:
- Log the stall event – A warning is emitted containing the issue identifier and elapsed stall duration.
- Determine retry attempt –
next_retry_attempt_from_running/1calculates the appropriate back-off level for the restart. - Terminate the stuck task –
terminate_running_issue/3forcefully stops the original task and triggers necessary workspace cleanup. - Re-queue with back-off –
schedule_issue_retry/4creates a new execution attempt with an error payload describing the stall reason and applies exponential back-off before the next dispatch.
This sequence ensures that stalled issues are cleanly isolated and re-inserted into the pipeline with appropriate delay mechanisms.
Configuring the Stall Timeout in Practice
You can adjust the stall detection sensitivity at runtime by updating the configuration struct:
# Reduce stall timeout from default 5 minutes to 2 minutes
config = SymphonyElixir.Config.settings!()
updated_codex = Map.put(config.codex, :stall_timeout_ms, 2 * 60_000)
SymphonyElixir.Config.put_settings!(%{
config |
codex: updated_codex
})
In production deployments, these functions execute automatically within the orchestrator's main loop; manual invocation is typically unnecessary unless implementing custom monitoring tooling.
Key Source Files for Issue Restart Logic
Understanding the restart mechanism requires familiarity with these core modules:
elixir/lib/symphony_elixir/orchestrator.ex– Containsreconcile_stalled_running_issues/1,stall_elapsed_ms/2, andrestart_stalled_issue/5, implementing the complete detection and recovery flow.elixir/lib/symphony_elixir/config/schema.ex– Defines thestall_timeout_msfield (lines 175-178) used to configure detection sensitivity.elixir/lib/symphony_elixir/config.ex– Provides runtime access to configuration settings consumed by the orchestrator.elixir/lib/symphony_elixir/tracker.ex– Maintains execution timestamps includinglast_codex_timestampused for idle time calculations.elixir/lib/symphony_elixir/workspace.ex– Handles resource cleanup operations invoked during task termination.
Summary
Symphony implements automatic recovery for stalled issues through a deterministic three-phase process:
- Detection occurs via
reconcile_stalled_running_issues/1, which compares elapsed idle time againstConfig.settings!().codex.stall_timeout_msevery orchestration cycle. - Calculation uses
stall_elapsed_ms/2to safely compute idle duration fromlast_codex_timestamp, preventing false restarts when data is missing. - Recovery executes through
restart_stalled_issue/5, which terminates the stuck task and re-queues it with exponential back-off and descriptive error context.
This architecture ensures that temporary hangs or silent failures do not block the orchestration pipeline indefinitely.
Frequently Asked Questions
What triggers a stalled issue restart in Symphony?
An issue restarts when the time elapsed since its last activity exceeds the stall_timeout_ms configuration value. The orchestrator evaluates this condition continuously through reconcile_stalled_running_issues/1, checking the last_codex_timestamp of every running issue against the current time.
How does Symphony calculate if an issue is stalled?
Symphony uses the stall_elapsed_ms/2 function to compute idle time. It takes the issue state and calculates the millisecond difference between the current time and the last_codex_timestamp (or start time if no activity recorded). If this duration exceeds the configured timeout, the issue qualifies for restart.
Can I configure the stall timeout duration?
Yes. The timeout is configurable through Config.settings!().codex.stall_timeout_ms, defined in the schema at elixir/lib/symphony_elixir/config/schema.ex. You can modify this value at runtime using SymphonyElixir.Config.put_settings!/1, allowing you to tune detection sensitivity for different workload types without code changes.
What happens to the original task when Symphony restarts a stalled issue?
The original task is forcefully terminated via terminate_running_issue/3, which handles cleanup of associated resources and workspace state. The orchestrator then creates a new execution attempt through schedule_issue_retry/4, applying exponential back-off to prevent immediate retry loops and preserving error context describing the stall condition.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →