AI Assessment Stuck-Running Watchdog

Available from: v0.1.228

Overview

The AI Assessment Watchdog is a nightly automated job that detects and recovers AI deduction assessments that have become stuck in a running state. It ensures agents are never left waiting indefinitely on an assessment that has silently failed.

How it works

Schedule

The watchdog runs on a cron schedule every night at 03:00 UTC (0 3 * * *).

Detection logic

On each run, the job queries ai_deduction_assessments for records matching both of the following conditions:

Field	Condition
`status`	`'running'`
`startedAt`	more than 10 minutes before the sweep time

Recovery actions

For each stuck assessment found, the job performs two operations atomically:

1. Marks the assessment as failed

Field	Value set
`status`	`'failed'`
`errorMessage`	`'Assessment timed out'`
`failedAt`	timestamp of the watchdog sweep

2. Creates an in-app notification

Recipient: the agent identified by requestedById on the assessment record
Severity: warning
Content: includes a direct link to retry the timed-out assessment

Agent experience

If one of your AI deduction assessments is caught by the watchdog, you will receive an in-app warning notification the morning after the assessment got stuck. The notification contains a retry link so you can re-trigger the assessment immediately without navigating away.

Assessments marked failed by the watchdog behave identically to any other failed assessment — they can be retried, and their status history is preserved for audit purposes.

Affected entities

aiDeductionAssessments — status, errorMessage, and failedAt fields updated
notifications — new warning notification record inserted per affected assessment

Frequently asked questions

Q: Why would an assessment get stuck in running?
AI assessments rely on external model inference. Upstream timeouts, infrastructure interruptions, or dropped connections can leave an assessment mid-run with no way to self-resolve. The watchdog provides a safety net for these edge cases.

Q: What is the 10-minute threshold based on?
Under normal conditions, AI deduction assessments complete well within 10 minutes. A running record older than 10 minutes is a reliable signal that the process did not complete successfully.

Q: Will I be notified for every stuck assessment?
Yes — one in-app notification is created per affected assessment, linked to the specific assessment so you can act on each one individually.

Q: Is there any data loss when an assessment is timed out?
No. The assessment record is preserved with its full history. Only the status fields are updated to reflect the failure. You can retry the assessment to generate a new result.