# How Symphony's Stall Detection Mechanism Works: Timeout Detection and Recovery in OpenAI's Orchestrator

> Discover how Symphony's stall detection monitors worker timeouts and recovers hanged issues with automatic rescheduling and exponential backoff for robust OpenAI orchestrator performance.

- Repository: [OpenAI/symphony](https://github.com/openai/symphony)
- Tags: internals
- Published: 2026-05-08

---

**Symphony's stall detection mechanism monitors the elapsed time since the last Codex activity for every running issue, terminating workers that exceed a configurable timeout and automatically rescheduling them with exponential backoff to prevent indefinite hangs.**

Symphony is OpenAI's Elixir-based orchestration framework designed to manage complex issue workflows. Its **stall detection mechanism**, implemented in the `Orchestrator` module, prevents runaway tasks from consuming resources indefinitely by forcefully terminating executions that show no Codex activity within a defined window and retrying them with progressive delays.

## How the Orchestrator Detects Stalls

According to the OpenAI Symphony source code, the stall detection logic resides in [`lib/symphony_elixir/orchestrator.ex`](https://github.com/openai/symphony/blob/main/lib/symphony_elixir/orchestrator.ex). Each orchestration cycle invokes specific functions to evaluate whether running issues have become unresponsive.

### Configuration and Timeout Settings

The mechanism relies on the `stall_timeout_ms` parameter defined in [`lib/symphony_elixir/config/schema.ex`](https://github.com/openai/symphony/blob/main/lib/symphony_elixir/config/schema.ex) (lines 176-182). By default, this value is set to `300_000` milliseconds (5 minutes). Runtime access occurs through `Config.settings!().codex.stall_timeout_ms`, allowing dynamic adjustment without recompiling the application.

To modify the timeout dynamically:

```elixir
{:ok, config} = SymphonyElixir.Config.load()
new_cfg = put_in(config.codex.stall_timeout_ms, 60_000)  # 1-minute timeout

Application.put_env(:symphony_elixir, :config, new_cfg)

```

### The Reconciliation Pass

During each orchestration cycle, the `reconcile_stalled_running_issues/1` function (lines 48-66) serves as the entry point for stall detection. This function short-circuits immediately if the timeout is configured as ≤ 0 or if no issues are currently running. Otherwise, it iterates over the `state.running` map to evaluate each active issue.

### Calculating Inactivity Duration

For every running issue, `stall_elapsed_ms/2` (lines 89-97) calculates the idle duration by comparing the current UTC time against the last activity timestamp. The function selects the timestamp from `:last_codex_timestamp` if present; otherwise, it falls back to `:started_at` (the moment the issue began execution). This ensures that stalls are detected based on the most recent Codex interaction rather than total runtime.

## Recovery Actions for Stalled Issues

When `stall_elapsed_ms/2` returns an integer exceeding `stall_timeout_ms` (lines 68-75), the orchestrator flags the issue as stalled and initiates a two-phase recovery process.

### Terminating the Worker Process

The `terminate_running_issue/3` function (lines 15-44) immediately stops the stalled worker process. This cleanup prevents zombie processes from lingering when Codex activity has ceased, freeing up system resources and maintaining orchestrator health.

### Rescheduling with Exponential Backoff

Following termination, `schedule_issue_retry/4` (lines 78-84) reschedules the issue with a new error payload that records the specific stall duration ("stalled for … ms without codex activity"). The `next_retry_attempt_from_running/1` function calculates the next retry interval using exponential backoff, ensuring that repeated stalls progressively increase wait times before subsequent execution attempts.

When a stall is detected, the system logs a descriptive warning:

```elixir
Logger.warning(
  "Issue stalled: issue_id=42 issue_identifier=REQ-123 session_id=abcd1234 elapsed_ms=310000; restarting with backoff"
)

```

## Testing and Debugging Stall Detection

You can manually trigger a stall reconciliation pass for testing or debugging purposes outside the normal orchestration loop:

```elixir
def force_stall_check(state) do
  # Re-run the stall reconciliation pass outside the normal loop

  SymphonyElixir.Orchestrator.reconcile_stalled_running_issues(state)
end

```

## Summary

- Symphony's stall detection mechanism prevents indefinite execution by monitoring **Codex activity timestamps** in [`lib/symphony_elixir/orchestrator.ex`](https://github.com/openai/symphony/blob/main/lib/symphony_elixir/orchestrator.ex).
- The default **stall timeout** is 300,000 ms (5 minutes), configurable via `Config.settings!().codex.stall_timeout_ms` in [`lib/symphony_elixir/config/schema.ex`](https://github.com/openai/symphony/blob/main/lib/symphony_elixir/config/schema.ex).
- Detection occurs through **`reconcile_stalled_running_issues/1`**, which leverages **`stall_elapsed_ms/2`** to compare current UTC time against `:last_codex_timestamp` or `:started_at`.
- Stalled issues trigger **immediate termination** via `terminate_running_issue/3` followed by **rescheduling with exponential backoff** through `schedule_issue_retry/4` and `next_retry_attempt_from_running/1`.
- This architecture ensures failed or hanging Codex sessions do not block the orchestration pipeline indefinitely.

## Frequently Asked Questions

### How does Symphony determine if an issue has stalled?

Symphony calculates the elapsed milliseconds since the last Codex activity using `stall_elapsed_ms/2` in [`lib/symphony_elixir/orchestrator.ex`](https://github.com/openai/symphony/blob/main/lib/symphony_elixir/orchestrator.ex). If the result exceeds the configured `stall_timeout_ms` (default 5 minutes), and the timestamp indicates no recent Codex interaction, the orchestrator flags the issue as stalled and initiates recovery.

### Can I disable stall detection in Symphony?

Yes, by setting the `stall_timeout_ms` configuration value to 0 or a negative integer in the config schema. When `reconcile_stalled_running_issues/1` detects a timeout ≤ 0, it short-circuits and skips stall detection entirely for that cycle, allowing issues to run indefinitely without intervention.

### What happens to the original worker process when a stall is detected?

The orchestrator calls `terminate_running_issue/3` to immediately stop the original worker process (lines 15-44). This cleanup prevents resource leaks and ensures the issue can be safely rescheduled with a fresh execution context rather than attempting to recover a frozen state.

### Where is the stall timeout configured in Symphony?

The timeout is defined in [`lib/symphony_elixir/config/schema.ex`](https://github.com/openai/symphony/blob/main/lib/symphony_elixir/config/schema.ex) (lines 176-182) with a default of 300,000 ms. Runtime access occurs through `Config.settings!().codex.stall_timeout_ms`, allowing dynamic adjustments via `Application.put_env/3` without requiring application recompilation or restarts.