Error Rate Tracking — What We're Missing and How We'll Fix It
Error Rate Tracking — What We're Missing and How We'll Fix It
Release: v0.1.136 · Control: ERR-25 · Category:
err_monitoring
Overview
As part of a systematic audit of the platform's observability controls, we identified a significant gap: the platform currently has no error rate metrics, counters, or dashboards.
This post explains what that means in practice, why it matters for a sanctions-screening platform, and what we plan to do about it.
What Exists Today
The codebase includes a captureError() utility that applies a 5-errors-per-minute client-side rate limiter. This mechanism exists to prevent log flooding and duplicate noise from rapid-fire errors — it is a spam filter, not a measurement tool.
It does not:
- Record how many errors occurred over a time window
- Distinguish between error types or severity levels
- Emit metrics that can be graphed or alerted on
- Retain historical data for trend analysis
Why This Matters
For a compliance platform that runs scheduled OFSI list syncs and batch rescreening jobs, undetected error rate degradation poses real risk:
| Scenario | Current Detectability |
|---|---|
| Screening failure rate increases after a dependency update | ❌ Not detectable |
| Nightly sync starts failing silently 30% of the time | ❌ Not detectable |
| Batch rescreen job produces partial results under load | ❌ Not detectable |
| A spike in fuzzy-match errors follows a schema change | ❌ Not detectable |
Without rate data, there is no baseline to alert against and no history to diagnose incidents from.
Remediation Plan
Three complementary fixes are planned, in dependency order:
1. Sentry Error Volume Charts
Prerequisite: ERR-22 (Sentry integration) must be completed first.
Once Sentry is integrated, its built-in error volume charts will provide immediate baseline visibility — error counts over time, grouped by issue type, with environment filtering. This is the fastest path to basic rate awareness.
2. Structured Logging in Nightly Sync and Batch Rescreen
The nightly OFSI sync and batch rescreen jobs will be updated to emit structured log lines including explicit fields:
{
"job": "nightly_sync",
"status": "completed",
"success_count": 1842,
"failure_count": 3,
"duration_ms": 4721,
"timestamp": "2025-01-15T02:00:04.123Z"
}
Structured output makes these fields queryable from Vercel log drains without manual log parsing, enabling integration with any downstream log aggregation or alerting tool.
3. /api/metrics Endpoint
A new HTTP endpoint will expose sync success rates sourced directly from the syncLog table:
GET /api/metrics
This provides a lightweight, polling-friendly interface suitable for integration with external monitoring tools such as UptimeRobot, Grafana, or Datadog. The endpoint will surface recent sync job outcomes so that degradation trends are visible without requiring access to raw logs.
Status
| Item | Status |
|---|---|
| ERR-22 Sentry integration | Prerequisite — pending |
| Structured logging (sync + rescreen) | Planned |
/api/metrics endpoint | Planned |
No application code was changed in this release. This document tracks the identified gap and the agreed remediation approach.
Related Controls
- ERR-22 — Sentry integration (prerequisite for error volume charts)
- ERR-25 — Error rate tracking (this control)