Improving Resilience: Retry Logic for External API Calls
Improving Resilience: Retry Logic for External API Calls
Release: v1.0.401 | Control: ERR-11 | Category: Resilience
The Problem
When users trigger actions that reach out to HMRC, AgentOS, or bank connection APIs — such as refreshing their registered businesses or listing properties — those calls previously had no protection against transient failures.
A single dropped connection or a momentary HMRC 503 response would immediately return an error to the user, even though a retry a fraction of a second later would likely have succeeded.
This was a gap between two parts of the platform:
- Inngest background functions had robust retry logic configured (
retries: 3, withNonRetriableErrorused to skip retries on permanent failures). - tRPC handlers making direct external API calls had no equivalent — they failed immediately on the first error.
The Fix
A retry utility has been added to tRPC query and mutation handlers that make direct external API calls. It implements exponential backoff with up to three retries:
Attempt 1 → fails → wait 200ms
Attempt 2 → fails → wait 400ms
Attempt 3 → fails → wait 800ms
Attempt 4 → fails → surface error to user
What Gets Retried
Only transient failures are retried:
- 5xx server errors (e.g. HMRC 503 Service Unavailable, 500 Internal Server Error)
- Network-level errors (e.g. connection timeouts, DNS failures, socket resets)
4xx client errors are not retried. These indicate a permanent problem with the request itself — an invalid parameter, an expired token, or a resource that does not exist. Retrying them would not help and would only add unnecessary latency.
Affected Endpoints
| tRPC Handler | External Service | File |
|---|---|---|
hmrc.refreshBusinesses | HMRC API | src/lib/routers/hmrc.ts |
agentos.listProperties | AgentOS | src/lib/routers/hmrc.ts |
bank.getConnection | Bank feed provider | src/lib/routers/hmrc.ts |
What This Means for Users
Most transient failures — the kind caused by brief network instability or a momentary API hiccup — will now be resolved automatically and transparently. Users will see a successful response rather than an error, without needing to manually refresh or retry the action themselves.
Only genuine, persistent failures (e.g. HMRC returning a 4xx because of an authentication issue, or a prolonged outage lasting several seconds) will surface as errors.
Relationship to Inngest Retry Logic
This retry utility is entirely separate from Inngest's built-in retry mechanism. Inngest handles background jobs (asynchronous processing, webhook ingestion, scheduled tasks). The new retry utility handles synchronous tRPC paths — the real-time calls that happen directly in response to a user action in the UI. Both layers now have consistent resilience strategies.