Stale Pipeline Run Recovery Sweep

The Stale Pipeline Run Recovery Sweep is a nightly automated batch job that detects pipeline runs stuck in intermediate states and resets them for retry. It ensures the autonomous development loop never silently stalls due to transient failures or agent timeouts.

How It Works

Every night at 04:00 UTC, the sweep scans all active pipelineRuns for records that have been in a non-terminal intermediate status for more than 4 hours. When a stale run is found:

The pipelineRun is marked as failed with the error reason timeout.
Any associated features in the in_progress state are reset to found.
The unified pipeline loop picks them up on its next cycle for a fresh attempt.

Stale Status Thresholds

The following intermediate statuses are monitored. A run is considered stale if it has remained in any of these states for more than 4 hours:

Status	Description
`queued`	Waiting to be picked up by an agent
`researching`	Research agent is gathering context
`architecting`	Architect agent is decomposing work
`implementing`	Engineer agents are writing code
`testing`	CI / test agents are verifying changes
`awaiting_approval`	Pending automated or human approval
`releasing`	Release pipeline is running
`marketing`	Marketing agents are preparing assets
`documenting`	Documentation agent is generating pages

Terminal statuses (completed, failed, cancelled) are never treated as stale.

Schedule

Cron: 0 4 * * *
Time: 04:00 UTC, daily
Type: nightly_batch

Affected Entities

pipelineRuns — Stale runs are transitioned to failed with error: timeout.
features — Associated in_progress features are reset to found for retry.

Retry Behavior

Resetting a feature to found re-enters it into the standard unified pipeline loop. The next loop cycle will re-queue the feature and begin a fresh pipeline run from the start. No data from the failed run is carried forward.

Observability

Each sweep execution produces log entries for:

The number of stale pipeline runs detected
The IDs of runs marked as failed
The number of features reset to found

These logs are available in the platform's operational observability dashboard.