Resilience Gap: No Circuit Breaker for External Services (ERR-12)
Resilience Gap: No Circuit Breaker for External Services (ERR-12)
Status: Documented finding — remediation partially recommended, not yet implemented.
Severity: Medium
Affects: Twilio SMS alerting, Stripe billing, OFSI sanctions endpoint
What is a Circuit Breaker?
A circuit breaker is a runtime resilience pattern that monitors calls to an external service. When a threshold of consecutive failures is reached (the circuit "trips"), subsequent calls are short-circuited and fail fast — without actually hitting the failing service — until the circuit resets after a configured time window.
This prevents:
- Cascading failures across request threads
- Spam to a dead downstream service
- Log noise from repeated identical errors
- Increased latency caused by waiting for timeouts on every request
Current State in This Platform
The platform integrates with three external services:
| Service | Purpose | Circuit Breaker? |
|---|---|---|
| Twilio | SMS alerts for sanctions matches | ❌ None |
| Stripe | Subscription billing | ⚠️ Static config guard only |
| OFSI endpoint | Nightly consolidated list sync | ❌ None |
Twilio
SMS alerts are dispatched on the request path when a sanctions screening match is detected. If the Twilio API is unavailable, every alert attempt will fail and be retried on each subsequent matching request. There is no failure counter, no trip threshold, and no suppression window. During a Twilio outage this produces repeated failed delivery attempts and error log entries for every screened entity that triggers an alert.
Stripe
The isStripeConfigured() helper validates that the required Stripe credentials are present in the environment at startup. This is a static configuration guard, not a runtime circuit breaker — it will not detect a Stripe API degradation that occurs after the service has started. However, given the non-critical, low-frequency nature of billing calls relative to the core screening path, this is considered acceptable at the current scale.
OFSI Endpoint
The nightly sync job fetches the OFSI consolidated list on a scheduled basis. If the OFSI endpoint is unavailable, the sync will fail without back-off or circuit tripping. Successive nightly runs against an unavailable endpoint will each attempt a full fetch and log a failure.
Recommended Mitigations
1. Twilio — In-Memory Failure Counter (Recommended)
Implement a lightweight failure counter scoped to the Twilio SMS dispatch function. No third-party library is required.
Proposed behaviour:
On each Twilio call failure:
increment failure_count
record timestamp of first failure in current window
Before each Twilio call:
if failure_count >= 5 AND time_since_first_failure <= 5 minutes:
skip call, log single WARNING: "Twilio circuit open — suppressing SMS"
return early
On successful Twilio call:
reset failure_count to 0
Storage options:
- In-memory: Simplest to implement; resets on process restart. Acceptable for a single-instance deployment.
- Redis-backed: Required if the application runs across multiple instances or uses a serverless/edge runtime where in-memory state is not shared.
Trip threshold: 5 consecutive failures within a 5-minute window.
Reset: Automatic after the window expires, or on next successful call.
2. Stripe — No Change Required
The existing isStripeConfigured() static guard is sufficient. Stripe calls are infrequent and not on the critical compliance screening path. If Stripe is unavailable at runtime, the failure is surfaced to the user as a standard error response.
3. OFSI Sync — Exponential Back-off (Recommended)
Rather than a circuit breaker (which is most useful on the request path), the nightly sync job should implement exponential back-off with a maximum retry cap. If the initial sync attempt fails, retry after 5 minutes, then 15 minutes, then 1 hour, before logging a final failure and alerting the operations team.
Why Not a Full Circuit Breaker Library?
Libraries such as opossum (Node.js) provide full circuit breaker semantics including half-open states, event emitters, and dashboards. For a compliance SaaS at this scale, that complexity is likely unnecessary overhead. The simple failure counter described above provides the core benefit — suppressing spam to a dead service — without adding a dependency or significant architectural complexity.
If the platform scales to high request volumes or a microservices topology, a proper circuit breaker library should be reconsidered.