Hardening External Service Calls: Retry Logic & Exponential Backoff (ERR-11)

Version: 0.1.129
Category: Error Resilience

The Problem

Compliance platforms must be reliable. But reliability isn't only about your own infrastructure — it's about how your system behaves when the external services you depend on have a bad moment.

Prior to this release, the platform made fire-and-forget calls to several critical external services:

Twilio — SMS alerts for sanctions matches
Stripe — Billing and subscription management
SaaS Factory error ingest — Internal error reporting pipeline
OFSI consolidated list endpoint — Nightly sanctions data sync

If any of these services returned a 429 Too Many Requests or 503 Service Unavailable, the call would fail immediately and silently. No retry. No alert. No log entry indicating the failure needed follow-up.

For SMS alerts in particular, this is a serious gap. A compliance team that doesn't receive a sanctions match notification — because Twilio was briefly rate-limiting at 2 a.m. — has a real operational problem.

What ERR-11 Fixes

Twilio SMS — Exponential Backoff Retry

The most critical fix in this release is in src/lib/sms.ts. Outbound SMS calls are now wrapped in a retry function with exponential backoff:

Attempts: Up to 3 (1 initial + 2 retries)
Delays: 1 s after the first failure, 2 s after the second
Trigger: Any transient error (e.g. 429, 503)

This means a brief Twilio rate-limit will no longer silently drop a sanctions alert. The system will back off and try again before giving up and surfacing a hard error.

Attempt 1 → fails (429)
  wait 1s
Attempt 2 → fails (503)
  wait 2s
Attempt 3 → succeeds ✓

Only after all retries are exhausted is the failure propagated up and logged.

OFSI Nightly Sync — Identified Gap

The nightly sync that pulls the OFSI consolidated list currently uses a URL-array fallback — if one mirror URL fails, it tries the next. This is not the same as retry-with-backoff for a single URL, and it does not handle transient failures like a temporary 503 from all mirrors simultaneously.

The recommended path forward is to move the nightly sync into a background job queue (such as Inngest or a scheduled GitHub Actions workflow with built-in retry). This ensures transient OFSI fetch failures are automatically retried without requiring manual intervention to re-trigger the sync.

Batch Re-screening — Identified Gap

The runBatchRescreen engine is currently invoked synchronously within an HTTP request. This means:

There is no retry capability if the batch job fails partway through
Long-running rescreens may hit request timeout limits
A transient downstream error aborts the entire batch with no recovery

Migrating runBatchRescreen to a background job queue is flagged for a follow-up release.

Why Silent Failures Are Especially Dangerous Here

Sanctions screening is a regulated compliance function. A missed SMS alert due to a silent Twilio failure isn't just a UX inconvenience — it could mean a compliance officer isn't notified of a flagged entity in time to act. Retry logic is a minimum baseline expectation for any system operating in this space.

What's Still Pending

Service	Status
Twilio SMS	✅ Fixed in v0.1.129
OFSI Nightly Sync	⚠️ Recommended: move to background job queue
Stripe API	⚠️ No retry — flagged for follow-up
SaaS Factory error ingest	⚠️ No retry — flagged for follow-up
`runBatchRescreen`	⚠️ Synchronous, no retry — flagged for background queue migration

Hardening External Service Calls: Retry Logic & Exponential Backoff (ERR-11)

Hardening External Service Calls: Retry Logic & Exponential Backoff (ERR-11)

The Problem

What ERR-11 Fixes

Twilio SMS — Exponential Backoff Retry

OFSI Nightly Sync — Identified Gap

Batch Re-screening — Identified Gap

Why Silent Failures Are Especially Dangerous Here

What's Still Pending

Related