Hardening External Service Calls: Retry Logic & Exponential Backoff (ERR-11)
Hardening External Service Calls: Retry Logic & Exponential Backoff (ERR-11)
Version: 0.1.129
Category: Error Resilience
The Problem
Compliance platforms must be reliable. But reliability isn't only about your own infrastructure — it's about how your system behaves when the external services you depend on have a bad moment.
Prior to this release, the platform made fire-and-forget calls to several critical external services:
- Twilio — SMS alerts for sanctions matches
- Stripe — Billing and subscription management
- SaaS Factory error ingest — Internal error reporting pipeline
- OFSI consolidated list endpoint — Nightly sanctions data sync
If any of these services returned a 429 Too Many Requests or 503 Service Unavailable, the call would fail immediately and silently. No retry. No alert. No log entry indicating the failure needed follow-up.
For SMS alerts in particular, this is a serious gap. A compliance team that doesn't receive a sanctions match notification — because Twilio was briefly rate-limiting at 2 a.m. — has a real operational problem.
What ERR-11 Fixes
Twilio SMS — Exponential Backoff Retry
The most critical fix in this release is in src/lib/sms.ts. Outbound SMS calls are now wrapped in a retry function with exponential backoff:
- Attempts: Up to 3 (1 initial + 2 retries)
- Delays: 1 s after the first failure, 2 s after the second
- Trigger: Any transient error (e.g.
429,503)
This means a brief Twilio rate-limit will no longer silently drop a sanctions alert. The system will back off and try again before giving up and surfacing a hard error.
Attempt 1 → fails (429)
wait 1s
Attempt 2 → fails (503)
wait 2s
Attempt 3 → succeeds ✓
Only after all retries are exhausted is the failure propagated up and logged.
OFSI Nightly Sync — Identified Gap
The nightly sync that pulls the OFSI consolidated list currently uses a URL-array fallback — if one mirror URL fails, it tries the next. This is not the same as retry-with-backoff for a single URL, and it does not handle transient failures like a temporary 503 from all mirrors simultaneously.
The recommended path forward is to move the nightly sync into a background job queue (such as Inngest or a scheduled GitHub Actions workflow with built-in retry). This ensures transient OFSI fetch failures are automatically retried without requiring manual intervention to re-trigger the sync.
Batch Re-screening — Identified Gap
The runBatchRescreen engine is currently invoked synchronously within an HTTP request. This means:
- There is no retry capability if the batch job fails partway through
- Long-running rescreens may hit request timeout limits
- A transient downstream error aborts the entire batch with no recovery
Migrating runBatchRescreen to a background job queue is flagged for a follow-up release.
Why Silent Failures Are Especially Dangerous Here
Sanctions screening is a regulated compliance function. A missed SMS alert due to a silent Twilio failure isn't just a UX inconvenience — it could mean a compliance officer isn't notified of a flagged entity in time to act. Retry logic is a minimum baseline expectation for any system operating in this space.
What's Still Pending
| Service | Status |
|---|---|
| Twilio SMS | ✅ Fixed in v0.1.129 |
| OFSI Nightly Sync | ⚠️ Recommended: move to background job queue |
| Stripe API | ⚠️ No retry — flagged for follow-up |
| SaaS Factory error ingest | ⚠️ No retry — flagged for follow-up |
runBatchRescreen | ⚠️ Synchronous, no retry — flagged for background queue migration |