All Docs
FeaturesCalmony Sanctions MonitorUpdated March 12, 2026

Exponential Backoff Retry (ERR-11)

Exponential Backoff Retry (ERR-11)

As of v0.1.116, the platform automatically retries transient failures when communicating with external services. This means a momentary Twilio rate limit or a brief OFSI gov.uk outage will not silently drop an SMS alert or require you to manually re-trigger the nightly sync.

What is covered

IntegrationTriggerMax retriesBase delayCap
Twilio SMSHTTP 429 or 5xx31 s10 s
OFSI page scrapeHTTP 429/5xx or network error32 s
OFSI CSV downloadHTTP 429/5xx or network error32 s30 s

Retry schedule

Each retry uses exponential backoff with ±20 % jitter to prevent thundering-herd effects:

Attempt 1  →  immediate
Attempt 2  →  base delay           (e.g. ~1 s for SMS)
Attempt 3  →  base × multiplier²  (e.g. ~2 s for SMS)

For Twilio SMS the total worst-case wait before giving up is approximately 3.5 seconds — well within Vercel's 10 s edge function timeout.

For OFSI CSV downloads the cap is 30 seconds per gap, allowing the nightly job to tolerate a briefly overloaded gov.uk CDN.

Permanent errors are not retried

Not every failure is worth retrying. HTTP 4xx errors other than 429 (e.g. 400 Bad Request, 401 Unauthorised, 403 Forbidden, 404 Not Found) indicate a permanent problem with the request itself. These propagate immediately without consuming any retry budget, preserving fast failure for misconfiguration and credential issues.

The distinction is enforced via two explicit error classes:

  • RetriableError — the retry loop will wait and try again.
  • NonRetriableError — the retry loop rethrows immediately, no delay.

The retry utility

The retry logic lives in src/lib/retry.ts and is available for any future integration point in the codebase.

withRetry(fn, options)

Wrap any async function to add automatic retry behaviour:

import { withRetry, RetriableError, NonRetriableError } from "@/lib/retry";

const data = await withRetry(
  async () => {
    const res = await fetch("https://api.example.com/data");
    if (res.status === 429 || res.status >= 500) {
      throw new RetriableError(`HTTP ${res.status}`);
    }
    if (!res.ok) {
      throw new NonRetriableError(`Permanent failure: HTTP ${res.status}`);
    }
    return res.json();
  },
  {
    retries: 3,          // total attempts (including first)
    baseDelayMs: 1000,   // delay before attempt 2
    backoffMultiplier: 2,
    maxDelayMs: 10_000,  // cap
    jitter: true,        // ±20% randomisation
    label: "My API",     // appears in log output
  }
);

RetryOptions reference

OptionTypeDefaultDescription
retriesnumber3Maximum number of attempts (including the first call)
baseDelayMsnumber1000Delay in ms before the second attempt
backoffMultipliernumber2Multiplier applied to delay on each successive retry
maxDelayMsnumber30000Upper bound on computed delay
jitterbooleantrueAdd ±20 % randomisation to each delay
isRetriable(err) => booleanisTransientErrorCustom predicate to override transient detection
labelstring"operation"Label used in console log/warn output

Transient status codes

The following HTTP status codes are treated as transient by default and will trigger a retry:

CodeMeaning
408Request Timeout
429Too Many Requests
500Internal Server Error
502Bad Gateway
503Service Unavailable
504Gateway Timeout

Network-level errors

In addition to HTTP status codes, the following error types are automatically treated as transient:

  • AbortError / TimeoutError — thrown by AbortSignal.timeout()
  • TypeError containing "fetch" in the message — DNS failure, connection refused, etc.

Effect on SMS alerts

The SmsResult type now includes an optional attempts field. When a sanctions match SMS required more than one attempt, the field is populated with the total attempt count. This can be used for observability or alerting on elevated retry rates.

const result = await sendSmsAlert({ to: "+447700900000", body: "Match detected" });
// result.attempts will be 2 if the first attempt returned 503 and the second succeeded

Effect on nightly sync

The nightly OFSI sync (POST /api/sanctions/nightly) now has two retry-protected stages:

  1. Page scrape — the HTML page on gov.uk is fetched to discover the latest CSV URL. Transient failures retry up to 3 times before falling through to the known-URL fallback list (unchanged behaviour).
  2. CSV download — once a URL is resolved, the CSV itself is fetched with retry. If all 3 attempts fail with transient errors, the sync job throws and the error is recorded in the sync log as usual.

Manual re-triggering should only be necessary for genuine permanent failures (e.g. the CSV URL has moved and all fallback URLs are stale).