Org Database Provisioning Timeout Watchdog

Released in: v0.1.258 Type: Maintenance workflow Schedule: Hourly (0 * * * *)

Overview

The Organisation Database Provisioning Timeout Watchdog is an automated background workflow that detects organisation databases stuck in a provisioning or migrating state and notifies platform administrators so they can intervene before new organisations are impacted.

How it works

Every hour, the watchdog runs the following logic:

Identify stalled jobs — Query organization_databases for any record where:
- status is 'provisioning' or 'migrating', and
- provisionedAt is older than 30 minutes from the current time.
Notify administrators — For each stalled record, insert an error-level notification for every user with role = 'admin'.
No auto-remediation — The watchdog only alerts; it does not attempt to restart, cancel, or modify the stalled job. A human administrator is expected to investigate and resolve.

Entities involved

Entity	Role
`organization_databases`	Source of truth for provisioning/migration status
`organizations`	Parent entity associated with the stalled database
`users`	Filtered to `role = 'admin'` as notification recipients
`notifications`	Destination for inserted error alerts

Trigger

Cron: 0 * * * *

The workflow runs at the top of every hour. A provisioning job must exceed the 30-minute threshold and be detected on the next hourly sweep before a notification is generated — meaning the maximum detection latency is approximately 90 minutes in the worst case (job stalls just after a sweep, detected on the following one).

Notification format

Notifications are error-level and target all platform admins. Each notification is associated with the specific organization_database record that is stalled, allowing admins to identify which organisation is affected.

Why 30 minutes?

A 30-minute threshold is used as the baseline for a "stalled" job because routine provisioning and migration operations are expected to complete well within that window. Jobs that exceed 30 minutes are almost certainly deadlocked, failed, or waiting on an unresponsive dependency.

Responding to a watchdog alert

When a platform admin receives a watchdog notification, the recommended investigation steps are:

Identify the affected organization_database record by ID.
Check the associated provisioning or migration logs for errors.
Determine whether the job can be safely retried or must be manually resolved.
Update the organization_database status once the issue is resolved to prevent repeated notifications.

Limitations

The watchdog does not deduplicate notifications — if a job remains stalled across multiple hourly sweeps, a new notification will be generated each hour.
Only users with role = 'admin' receive alerts. Organisation-level users are not notified.
No auto-remediation is performed; manual intervention is always required.