Stale Pipeline Run Recovery Sweep
Stale Pipeline Run Recovery Sweep
The Stale Pipeline Run Recovery Sweep is a nightly automated batch job that detects pipeline runs stuck in intermediate states and resets them for retry. It ensures the autonomous development loop never silently stalls due to transient failures or agent timeouts.
How It Works
Every night at 04:00 UTC, the sweep scans all active pipelineRuns for records that have been in a non-terminal intermediate status for more than 4 hours. When a stale run is found:
- The
pipelineRunis marked asfailedwith the error reasontimeout. - Any associated
featuresin thein_progressstate are reset tofound. - The unified pipeline loop picks them up on its next cycle for a fresh attempt.
Stale Status Thresholds
The following intermediate statuses are monitored. A run is considered stale if it has remained in any of these states for more than 4 hours:
| Status | Description |
|---|---|
queued | Waiting to be picked up by an agent |
researching | Research agent is gathering context |
architecting | Architect agent is decomposing work |
implementing | Engineer agents are writing code |
testing | CI / test agents are verifying changes |
awaiting_approval | Pending automated or human approval |
releasing | Release pipeline is running |
marketing | Marketing agents are preparing assets |
documenting | Documentation agent is generating pages |
Terminal statuses (completed, failed, cancelled) are never treated as stale.
Schedule
Cron: 0 4 * * *
Time: 04:00 UTC, daily
Type: nightly_batch
Affected Entities
pipelineRuns— Stale runs are transitioned tofailedwitherror: timeout.features— Associatedin_progressfeatures are reset tofoundfor retry.
Retry Behavior
Resetting a feature to found re-enters it into the standard unified pipeline loop. The next loop cycle will re-queue the feature and begin a fresh pipeline run from the start. No data from the failed run is carried forward.
Observability
Each sweep execution produces log entries for:
- The number of stale pipeline runs detected
- The IDs of runs marked as
failed - The number of features reset to
found
These logs are available in the platform's operational observability dashboard.