How Symphony's Stall Detection Mechanism Works: Timeout Detection and Recovery in OpenAI's Orchestrator
Symphony's stall detection mechanism monitors the elapsed time since the last Codex activity for every running issue, terminating workers that exceed a configurable timeout and automatically rescheduling them with exponential backoff to prevent indefinite hangs.
Symphony is OpenAI's Elixir-based orchestration framework designed to manage complex issue workflows. Its stall detection mechanism, implemented in the Orchestrator module, prevents runaway tasks from consuming resources indefinitely by forcefully terminating executions that show no Codex activity within a defined window and retrying them with progressive delays.
How the Orchestrator Detects Stalls
According to the OpenAI Symphony source code, the stall detection logic resides in lib/symphony_elixir/orchestrator.ex. Each orchestration cycle invokes specific functions to evaluate whether running issues have become unresponsive.
Configuration and Timeout Settings
The mechanism relies on the stall_timeout_ms parameter defined in lib/symphony_elixir/config/schema.ex (lines 176-182). By default, this value is set to 300_000 milliseconds (5 minutes). Runtime access occurs through Config.settings!().codex.stall_timeout_ms, allowing dynamic adjustment without recompiling the application.
To modify the timeout dynamically:
{:ok, config} = SymphonyElixir.Config.load()
new_cfg = put_in(config.codex.stall_timeout_ms, 60_000) # 1-minute timeout
Application.put_env(:symphony_elixir, :config, new_cfg)
The Reconciliation Pass
During each orchestration cycle, the reconcile_stalled_running_issues/1 function (lines 48-66) serves as the entry point for stall detection. This function short-circuits immediately if the timeout is configured as ≤ 0 or if no issues are currently running. Otherwise, it iterates over the state.running map to evaluate each active issue.
Calculating Inactivity Duration
For every running issue, stall_elapsed_ms/2 (lines 89-97) calculates the idle duration by comparing the current UTC time against the last activity timestamp. The function selects the timestamp from :last_codex_timestamp if present; otherwise, it falls back to :started_at (the moment the issue began execution). This ensures that stalls are detected based on the most recent Codex interaction rather than total runtime.
Recovery Actions for Stalled Issues
When stall_elapsed_ms/2 returns an integer exceeding stall_timeout_ms (lines 68-75), the orchestrator flags the issue as stalled and initiates a two-phase recovery process.
Terminating the Worker Process
The terminate_running_issue/3 function (lines 15-44) immediately stops the stalled worker process. This cleanup prevents zombie processes from lingering when Codex activity has ceased, freeing up system resources and maintaining orchestrator health.
Rescheduling with Exponential Backoff
Following termination, schedule_issue_retry/4 (lines 78-84) reschedules the issue with a new error payload that records the specific stall duration ("stalled for … ms without codex activity"). The next_retry_attempt_from_running/1 function calculates the next retry interval using exponential backoff, ensuring that repeated stalls progressively increase wait times before subsequent execution attempts.
When a stall is detected, the system logs a descriptive warning:
Logger.warning(
"Issue stalled: issue_id=42 issue_identifier=REQ-123 session_id=abcd1234 elapsed_ms=310000; restarting with backoff"
)
Testing and Debugging Stall Detection
You can manually trigger a stall reconciliation pass for testing or debugging purposes outside the normal orchestration loop:
def force_stall_check(state) do
# Re-run the stall reconciliation pass outside the normal loop
SymphonyElixir.Orchestrator.reconcile_stalled_running_issues(state)
end
Summary
- Symphony's stall detection mechanism prevents indefinite execution by monitoring Codex activity timestamps in
lib/symphony_elixir/orchestrator.ex. - The default stall timeout is 300,000 ms (5 minutes), configurable via
Config.settings!().codex.stall_timeout_msinlib/symphony_elixir/config/schema.ex. - Detection occurs through
reconcile_stalled_running_issues/1, which leveragesstall_elapsed_ms/2to compare current UTC time against:last_codex_timestampor:started_at. - Stalled issues trigger immediate termination via
terminate_running_issue/3followed by rescheduling with exponential backoff throughschedule_issue_retry/4andnext_retry_attempt_from_running/1. - This architecture ensures failed or hanging Codex sessions do not block the orchestration pipeline indefinitely.
Frequently Asked Questions
How does Symphony determine if an issue has stalled?
Symphony calculates the elapsed milliseconds since the last Codex activity using stall_elapsed_ms/2 in lib/symphony_elixir/orchestrator.ex. If the result exceeds the configured stall_timeout_ms (default 5 minutes), and the timestamp indicates no recent Codex interaction, the orchestrator flags the issue as stalled and initiates recovery.
Can I disable stall detection in Symphony?
Yes, by setting the stall_timeout_ms configuration value to 0 or a negative integer in the config schema. When reconcile_stalled_running_issues/1 detects a timeout ≤ 0, it short-circuits and skips stall detection entirely for that cycle, allowing issues to run indefinitely without intervention.
What happens to the original worker process when a stall is detected?
The orchestrator calls terminate_running_issue/3 to immediately stop the original worker process (lines 15-44). This cleanup prevents resource leaks and ensures the issue can be safely rescheduled with a fresh execution context rather than attempting to recover a frozen state.
Where is the stall timeout configured in Symphony?
The timeout is defined in lib/symphony_elixir/config/schema.ex (lines 176-182) with a default of 300,000 ms. Runtime access occurs through Config.settings!().codex.stall_timeout_ms, allowing dynamic adjustments via Application.put_env/3 without requiring application recompilation or restarts.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →