Behind the Fix: Retry Logic and Resilience for External Services (ERR-11)
Behind the Fix: Retry Logic and Resilience for External Services (ERR-11)
Release: v0.1.112
Category: Error Resilience
The Problem
Compliance platforms depend on reliable delivery of alerts and data. When an external service returns a transient error — a 429 Too Many Requests from Twilio or a 503 Service Unavailable from the OFSI data feed — the correct behaviour is to wait briefly and try again. Before this release, the platform did not do that.
Specifically:
- Twilio SMS alerts would silently fail on any transient error. A rate limit or momentary outage meant a compliance alert was never delivered — with no log entry, no fallback, and no indication to the operator.
- The OFSI nightly sync had a URL-array fallback (it would try a list of mirror URLs in sequence), but this is not the same as retry-with-backoff. A temporary failure partway through a fetch was not retried.
- The Stripe API and SaaS Factory error ingest had no retry logic at all.
runBatchRescreen— the batch re-screening engine — was called synchronously within the HTTP request lifecycle, making it impossible to retry independently of the originating request.
The practical risk: a transient Twilio outage at the moment a sanctions match is detected would silently swallow the SMS alert. The compliance team would not be notified.
What Was Fixed in This Release
Twilio SMS (src/lib/sms.ts)
An exponential backoff retry wrapper has been added to the Twilio SMS dispatch path. The behaviour is:
- Attempt the SMS send.
- On a transient failure (e.g. 429, 503), wait 1 second and retry.
- On a second failure, wait 2 seconds and retry.
- On a third failure, propagate the error to the caller.
This means up to 3 total attempts before an SMS is treated as definitively failed. The wrapper is transparent — callers of the SMS module do not need to change.
Attempt 1 → fail → wait 1s
Attempt 2 → fail → wait 2s
Attempt 3 → fail → throw error
This eliminates the most critical silent failure mode: a dropped sanctions-match SMS alert due to a momentary Twilio rate limit.
What Is Still Being Tracked
This release addresses the highest-priority gap (SMS alerts). The following services are flagged for follow-up remediation:
OFSI Nightly Sync
The current implementation cycles through a list of fallback URLs but does not implement true retry-with-backoff for a given URL. The recommended fix is to move the nightly sync into a background job queue — either Inngest or a GitHub Actions scheduled workflow with built-in retry — so transient fetch failures are automatically retried without manual intervention.
Batch Re-screening (runBatchRescreen)
The batch re-screening engine runs synchronously inside the HTTP request that triggers it. This architecture does not support retry: if the process fails partway through, there is no mechanism to resume or retry failed records independently. Moving this to a background queue (alongside the OFSI sync) would resolve both issues.
Stripe API and Error Ingest
Neither the Stripe API calls nor the SaaS Factory error ingest pipeline currently have retry logic. These are lower urgency than SMS alerts but will be addressed in upcoming resilience work.
Recommendations for Operators
- Review SMS alert logs for gaps prior to v0.1.112. Any transient Twilio errors during that period would have resulted in undelivered alerts.
- Monitor the OFSI sync job manually until the background queue migration is in place. If the nightly sync fails, re-run it manually from the admin dashboard.
- No configuration changes are required to benefit from the new Twilio retry logic — it is active by default.
Summary
| Service | Status |
|---|---|
| Twilio SMS | ✅ Fixed — exponential backoff, 3 attempts |
| OFSI Nightly Sync | 🔄 Pending — background queue migration |
| Batch Re-screening | 🔄 Pending — background queue migration |
| Stripe API | 🔄 Pending |
| Error Ingest | 🔄 Pending |