How We Fixed Runaway External API Calls with AbortController (ERR-10)

Released in v1.0.394 · Category: Error Resilience

The Problem

Every production service eventually encounters a third-party API that stops responding mid-connection — not with an error, but with silence. The socket stays open, the bytes never arrive, and your serverless function sits waiting until the platform's own hard timeout fires.

That is exactly what was happening across four integration points in the platform:

HMRC API (src/lib/hmrc/client.ts)
AgentOS API (src/lib/agentos/client.ts)
TrueLayer bank feed API (src/lib/truelayer/client.ts)
Slack webhook (src/lib/slack-alert.ts)

None of these fetch() calls carried an AbortController signal or any other timeout mechanism. A single hung connection would occupy a Vercel serverless concurrency slot for anywhere from 10 to 60 seconds — the full platform timeout window — before the runtime forcibly killed the function.

Why the AgentOS batch was especially risky

The AgentOS getTransactions() flow is designed for efficiency: it issues up to 24 concurrent historical-statement fetches in a single batch. That concurrency is a feature, not a bug — it dramatically reduces wall-clock time for initial account imports. But the absence of per-request timeouts meant that a single hung connection in a batch of 24 could stall all 24 slots simultaneously, cascading into broader application degradation.

The Fix

A shared `fetchWithTimeout()` helper

Rather than bolt timeout logic onto each callsite individually, the fix introduces a single utility function that wraps the native fetch() API:

async function fetchWithTimeout(
  url: string,
  options: RequestInit,
  timeoutMs: number
): Promise<Response> {
  const controller = new AbortController();
  const id = setTimeout(() => controller.abort(), timeoutMs);
  try {
    return await fetch(url, { ...options, signal: controller.signal });
  } finally {
    clearTimeout(id);  // prevent timer leak if fetch resolves before timeout
  }
}

Two things are worth noting:

clearTimeout in the finally block — without this, a fast-resolving request would leave a dangling timer alive in the event loop, a subtle resource leak that accumulates at scale.
controller.abort() is the signal — when the timer fires, the in-flight fetch receives an AbortError. The calling code can catch and handle this as a distinct failure mode (e.g. return a 504 to the client rather than a generic 500).

Per-service timeout budgets

Not all integrations deserve the same patience. Timeouts are sized to match the expected latency profile and the cost of a long wait for each service:

Service	Timeout	Rationale
HMRC API	30 s	HMRC APIs are sometimes slow during peak filing periods; a generous budget avoids spurious failures on legitimate submissions.
AgentOS API	15 s	Per-request limit inside concurrent batches; fast enough to unblock a stalled batch before it cascades.
TrueLayer API	20 s	Bank-feed enrichment can be slow for large accounts but rarely needs more than 20 s.
Slack webhook	5 s	Fire-and-forget alerting; if Slack is slow, we don't want to delay the primary request path.

Impact

Concurrency slot protection — Vercel serverless functions are freed promptly on third-party API failure rather than hanging until the platform hard-kills them.
Faster failure, better UX — Users receive a meaningful timeout error within a bounded window instead of a generic gateway timeout from the platform.
Cascade prevention in batch imports — The AgentOS 24-request batch is now individually guarded; a single slow upstream cannot stall the entire concurrent group.
Uniform coverage — All four external integration clients are now protected by the same pattern, eliminating future inconsistency as new callsites are added.

Files Changed

src/lib/agentos/client.ts
src/lib/truelayer/client.ts
src/lib/slack-alert.ts
src/lib/hmrc/client.ts

This fix was tracked as ERR-10 in the platform error-resilience backlog.

How We Fixed Runaway External API Calls with AbortController (ERR-10)

How We Fixed Runaway External API Calls with AbortController (ERR-10)

The Problem

Why the AgentOS batch was especially risky

The Fix

A shared fetchWithTimeout() helper

Per-service timeout budgets

Impact

Files Changed

A shared `fetchWithTimeout()` helper