Blog: Never Lose Work Again — Introducing the Stale Pipeline Run Recovery Sweep

Never Lose Work Again — Introducing the Stale Pipeline Run Recovery Sweep

v1.0.57 · Nightly Batch · Shipped

A fully autonomous development platform is only as reliable as its ability to recover from failure. Agents crash. Infrastructure hiccups. Networks blip. When that happens mid-pipeline, the worst outcome isn't a failed run — it's a run that looks like it's still working when it isn't.

With v1.0.57, we've shipped the Stale Pipeline Run Recovery Sweep: a nightly batch job that hunts down stuck pipeline runs and brings them back to life.

The Problem: Silent Stalls

The SaaS Factory pipeline is long. A single feature travels through research, architecture, implementation, testing, approval, release, marketing, and documentation before it's done. Each stage is handled by a different agent, and each handoff is a potential failure point.

In most cases, failures are loud — an agent returns an error, CI fails, a webhook times out. But sometimes a run just... stops. The status shows implementing or testing or awaiting_approval, and nothing moves. The feature is in limbo, holding a lock that prevents the unified loop from retrying it.

Left undetected, these stale runs accumulate silently, blocking features from shipping.

The Solution: A 4-Hour Timeout + Nightly Reset

The Recovery Sweep runs every night at 04:00 UTC. It queries every pipeline run currently sitting in a non-terminal intermediate status and checks one simple thing: has this run been in this state for more than 4 hours?

If yes:

The pipeline run is marked failed with error: timeout — making it visible in dashboards and logs.
Every feature that was in_progress inside that run is reset to found — putting it back in the queue for a fresh attempt.

The unified loop picks up the reset features on its next cycle. No manual intervention. No lost work. No human bottleneck.

What Counts as "Stuck"?

The sweep monitors nine intermediate statuses across the full pipeline lifecycle:

queued → researching → architecting → implementing → testing → awaiting_approval → releasing → marketing → documenting

Terminal states (completed, failed, cancelled) are never touched — they're already resolved.

Why 4 Hours?

Four hours is long enough to accommodate legitimate long-running stages — a large implementation batch, a slow CI run, a multi-step release — while being short enough to surface genuine stalls within a single business day. The nightly 04:00 UTC schedule means most stale runs are cleared before the next active working period begins.

Self-Healing by Design

This sweep is a direct expression of the platform's core design principle: no human should be the bottleneck. When the system encounters a failure it can't immediately resolve, it doesn't wait for someone to notice. It cleans up, resets, and tries again — automatically, on a schedule, every single night.

With v1.0.57, stuck pipelines are no longer a silent problem. They're a temporary condition that fixes itself by morning.