ERR-24: Critical Error Alerting — Closing the P1 Observability Gap
ERR-24: Critical Error Alerting — Closing the P1 Observability Gap
Version: v0.1.132
Category: Error Monitoring
Severity: P1 — No alerting infrastructure exists
Overview
An audit of the sanctions screening platform's observability stack (control ERR-24) found that no alerting infrastructure is in place for P1-level errors. This means that critical failures — including nightly OFSI sync failures, database outages, and Stripe webhook processing errors — currently produce no notifications to on-call engineers or operations teams.
This post documents the identified gaps and the recommended steps to close them.
Identified Gaps
1. Nightly Sanctions Sync — No Failure Notifications
The nightly sync GitHub Actions workflow (nightly-sync.yml) downloads and processes the OFSI consolidated sanctions list. If this workflow fails, no notification is sent. Compliance teams and engineers have no automated signal that the sanctions data may be stale.
2. Database Health Check — No Uptime Alerting
The /api/health endpoint returns an HTTP 503 when the database is unavailable. However, no uptime monitor is pointed at this endpoint. A prolonged DB outage would go undetected unless observed directly.
3. Stripe Webhook Failures — Console Logging Only
Errors during Stripe webhook processing are written to the console log only. There is no exception capture and no alert triggered for payment-critical failures, making silent failures a real risk in production.
4. No Alerting Integrations Configured
None of the following integrations are currently configured:
- Slack webhook
- Email alerts
- PagerDuty
- Sentry alert rules
Recommended Remediation Steps
Step 1 — Configure Sentry Alert Rules
Add alert rules in your Sentry project for error rate spikes. This is also a prerequisite for ERR-22.
- Navigate to Sentry → Alerts → Create Alert Rule
- Set a threshold on error rate (e.g. >5 errors/min on any critical transaction)
- Route alerts to the appropriate Slack channel or PagerDuty service
Step 2 — Add Slack Webhook Notification to Nightly Sync
Update nightly-sync.yml to include a failure notification step:
- name: Notify Slack on failure
if: failure()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": ":rotating_light: Nightly OFSI sanctions sync failed. Please investigate immediately."
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
Ensure SLACK_WEBHOOK_URL is configured as a GitHub Actions secret.
Step 3 — Configure an Uptime Monitor on /api/health
Point an uptime monitoring service (e.g. Better Uptime, UptimeRobot) at your /api/health endpoint:
- Monitor type: HTTP(S)
- URL:
https://<your-domain>/api/health - Alert condition: Status code
503or non-200response - Escalation: Page on-call via PagerDuty or send a Slack notification
This ensures that database availability issues surface immediately rather than being discovered by end users.
Step 4 — Instrument Stripe Webhook Failures with captureException
In the Stripe webhook handler, replace or supplement console logging with Sentry exception capture, tagged at high severity:
import * as Sentry from '@sentry/nextjs';
try {
// ... Stripe webhook processing
} catch (error) {
Sentry.withScope((scope) => {
scope.setTag('severity', 'high');
scope.setTag('subsystem', 'stripe-webhook');
Sentry.captureException(error);
});
console.error('Stripe webhook processing failed:', error);
return res.status(500).json({ error: 'Webhook processing failed' });
}
This ensures payment-critical failures are captured in Sentry and can trigger alert rules configured in Step 1.
Interim Guidance
Until all remediation steps are applied, operations teams should:
- Manually verify the nightly sync GitHub Actions workflow each morning.
- Periodically check
/api/healthduring business hours. - Review application logs for Stripe webhook errors after any payment activity.
Related Controls
| Control | Description | Status |
|---|---|---|
| ERR-24 | P1 error alerting infrastructure | ⚠️ Open — No alerting configured |
| ERR-22 | Sentry alert rules for error rate spikes | ⛔ Blocked by ERR-24 |
Note: This is a P1 issue. Until alerting is in place, silent failures across sanctions data freshness, database availability, and payment processing represent operational and compliance risk.