Exponential Backoff Retry (ERR-11)
Exponential Backoff Retry (ERR-11)
As of v0.1.116, the platform automatically retries transient failures when communicating with external services. This means a momentary Twilio rate limit or a brief OFSI gov.uk outage will not silently drop an SMS alert or require you to manually re-trigger the nightly sync.
What is covered
| Integration | Trigger | Max retries | Base delay | Cap |
|---|---|---|---|---|
| Twilio SMS | HTTP 429 or 5xx | 3 | 1 s | 10 s |
| OFSI page scrape | HTTP 429/5xx or network error | 3 | 2 s | — |
| OFSI CSV download | HTTP 429/5xx or network error | 3 | 2 s | 30 s |
Retry schedule
Each retry uses exponential backoff with ±20 % jitter to prevent thundering-herd effects:
Attempt 1 → immediate
Attempt 2 → base delay (e.g. ~1 s for SMS)
Attempt 3 → base × multiplier² (e.g. ~2 s for SMS)
For Twilio SMS the total worst-case wait before giving up is approximately 3.5 seconds — well within Vercel's 10 s edge function timeout.
For OFSI CSV downloads the cap is 30 seconds per gap, allowing the nightly job to tolerate a briefly overloaded gov.uk CDN.
Permanent errors are not retried
Not every failure is worth retrying. HTTP 4xx errors other than 429 (e.g. 400 Bad Request, 401 Unauthorised, 403 Forbidden, 404 Not Found) indicate a permanent problem with the request itself. These propagate immediately without consuming any retry budget, preserving fast failure for misconfiguration and credential issues.
The distinction is enforced via two explicit error classes:
RetriableError— the retry loop will wait and try again.NonRetriableError— the retry loop rethrows immediately, no delay.
The retry utility
The retry logic lives in src/lib/retry.ts and is available for any future integration point in the codebase.
withRetry(fn, options)
Wrap any async function to add automatic retry behaviour:
import { withRetry, RetriableError, NonRetriableError } from "@/lib/retry";
const data = await withRetry(
async () => {
const res = await fetch("https://api.example.com/data");
if (res.status === 429 || res.status >= 500) {
throw new RetriableError(`HTTP ${res.status}`);
}
if (!res.ok) {
throw new NonRetriableError(`Permanent failure: HTTP ${res.status}`);
}
return res.json();
},
{
retries: 3, // total attempts (including first)
baseDelayMs: 1000, // delay before attempt 2
backoffMultiplier: 2,
maxDelayMs: 10_000, // cap
jitter: true, // ±20% randomisation
label: "My API", // appears in log output
}
);
RetryOptions reference
| Option | Type | Default | Description |
|---|---|---|---|
retries | number | 3 | Maximum number of attempts (including the first call) |
baseDelayMs | number | 1000 | Delay in ms before the second attempt |
backoffMultiplier | number | 2 | Multiplier applied to delay on each successive retry |
maxDelayMs | number | 30000 | Upper bound on computed delay |
jitter | boolean | true | Add ±20 % randomisation to each delay |
isRetriable | (err) => boolean | isTransientError | Custom predicate to override transient detection |
label | string | "operation" | Label used in console log/warn output |
Transient status codes
The following HTTP status codes are treated as transient by default and will trigger a retry:
| Code | Meaning |
|---|---|
408 | Request Timeout |
429 | Too Many Requests |
500 | Internal Server Error |
502 | Bad Gateway |
503 | Service Unavailable |
504 | Gateway Timeout |
Network-level errors
In addition to HTTP status codes, the following error types are automatically treated as transient:
AbortError/TimeoutError— thrown byAbortSignal.timeout()TypeErrorcontaining"fetch"in the message — DNS failure, connection refused, etc.
Effect on SMS alerts
The SmsResult type now includes an optional attempts field. When a sanctions match SMS required more than one attempt, the field is populated with the total attempt count. This can be used for observability or alerting on elevated retry rates.
const result = await sendSmsAlert({ to: "+447700900000", body: "Match detected" });
// result.attempts will be 2 if the first attempt returned 503 and the second succeeded
Effect on nightly sync
The nightly OFSI sync (POST /api/sanctions/nightly) now has two retry-protected stages:
- Page scrape — the HTML page on gov.uk is fetched to discover the latest CSV URL. Transient failures retry up to 3 times before falling through to the known-URL fallback list (unchanged behaviour).
- CSV download — once a URL is resolved, the CSV itself is fetched with retry. If all 3 attempts fail with transient errors, the sync job throws and the error is recorded in the sync log as usual.
Manual re-triggering should only be necessary for genuine permanent failures (e.g. the CSV URL has moved and all fallback URLs are stale).