Org Database Provisioning Timeout Watchdog
Org Database Provisioning Timeout Watchdog
Released in: v0.1.258 Type: Maintenance workflow Schedule: Hourly (
0 * * * *)
Overview
The Organisation Database Provisioning Timeout Watchdog is an automated background workflow that detects organisation databases stuck in a provisioning or migrating state and notifies platform administrators so they can intervene before new organisations are impacted.
How it works
Every hour, the watchdog runs the following logic:
-
Identify stalled jobs — Query
organization_databasesfor any record where:statusis'provisioning'or'migrating', andprovisionedAtis older than 30 minutes from the current time.
-
Notify administrators — For each stalled record, insert an error-level notification for every user with
role = 'admin'. -
No auto-remediation — The watchdog only alerts; it does not attempt to restart, cancel, or modify the stalled job. A human administrator is expected to investigate and resolve.
Entities involved
| Entity | Role |
|---|---|
organization_databases | Source of truth for provisioning/migration status |
organizations | Parent entity associated with the stalled database |
users | Filtered to role = 'admin' as notification recipients |
notifications | Destination for inserted error alerts |
Trigger
Cron: 0 * * * *
The workflow runs at the top of every hour. A provisioning job must exceed the 30-minute threshold and be detected on the next hourly sweep before a notification is generated — meaning the maximum detection latency is approximately 90 minutes in the worst case (job stalls just after a sweep, detected on the following one).
Notification format
Notifications are error-level and target all platform admins. Each notification is associated with the specific organization_database record that is stalled, allowing admins to identify which organisation is affected.
Why 30 minutes?
A 30-minute threshold is used as the baseline for a "stalled" job because routine provisioning and migration operations are expected to complete well within that window. Jobs that exceed 30 minutes are almost certainly deadlocked, failed, or waiting on an unresponsive dependency.
Responding to a watchdog alert
When a platform admin receives a watchdog notification, the recommended investigation steps are:
- Identify the affected
organization_databaserecord by ID. - Check the associated provisioning or migration logs for errors.
- Determine whether the job can be safely retried or must be manually resolved.
- Update the
organization_databasestatus once the issue is resolved to prevent repeated notifications.
Limitations
- The watchdog does not deduplicate notifications — if a job remains stalled across multiple hourly sweeps, a new notification will be generated each hour.
- Only users with
role = 'admin'receive alerts. Organisation-level users are not notified. - No auto-remediation is performed; manual intervention is always required.