Blog: How We Fixed a Silent OOM Bomb in Agent Observability

How We Fixed a Silent OOM Bomb in Agent Observability

v1.0.7 | Performance | observability.getAgentBreakdown

As SaaS Factory runs more agents across more projects, the observability layer has to keep up. In v1.0.7 we patched a subtle but serious performance anti-pattern in the getAgentBreakdown procedure — one that would have caused increasingly slow dashboards and eventual out-of-memory crashes as the platform scaled.

The Problem: Load Everything, Aggregate Nothing

The getAgentBreakdown tRPC procedure is responsible for the per-agent-type breakdown you see on the Observability dashboard: how many jobs each agent type has run, how many succeeded or failed, token usage, and average duration.

The original implementation worked like this:

Fetch every agentJob row for all projects the caller can access — no limit, no aggregation.
Loop through the results in JavaScript, building up a Map<agentType, stats> on the Node.js server.
Return the aggregated map to the client.

This is the classic load-then-aggregate anti-pattern. It looks harmless in development when the database has a few hundred rows. In production, with tens of thousands of agent job records across many projects, it means:

Massive data transfer between Postgres and the application server on every dashboard load.
Unbounded heap growth on the Node.js process — each row is a deserialized object held in memory during the loop.
O(n) response time that degrades linearly as job history accumulates.
A real OOM risk with no natural ceiling.

For comparison, the sibling procedure getSummary was already doing this correctly, using sql<number> aggregate filters pushed entirely into the database.

The Fix: Push Aggregation Into Postgres

The solution is a single SQL query that lets PostgreSQL do what it was built for:

SELECT
  agentType,
  COUNT(*)                                                         AS totalJobs,
  COUNT(*) FILTER (WHERE status = 'completed')                     AS completed,
  COUNT(*) FILTER (WHERE status = 'failed')                        AS failed,
  COALESCE(SUM((tokenUsage->>'inputTokens')::int),  0)             AS inputTokens,
  COALESCE(SUM((tokenUsage->>'outputTokens')::int), 0)             AS outputTokens,
  AVG(EXTRACT(EPOCH FROM (completedAt - startedAt)))               AS avgDurationSec
FROM agentJobs
JOIN pipelineRuns ...
GROUP BY agentType;

Key techniques used:

COUNT(*) FILTER (WHERE ...) — Standard SQL conditional aggregation. Computes multiple counts in a single scan instead of requiring separate queries per status.
COALESCE(SUM(...)::int, 0) — Safely handles rows where tokenUsage is null without spreading null into the result.
AVG(EXTRACT(EPOCH FROM ...)) — Computes average duration in seconds entirely in the database, no JavaScript Date arithmetic needed.
GROUP BY agentType — Returns exactly one row per distinct agent type, regardless of how many millions of underlying job rows exist.

The Result

Metric	Before	After
Rows transferred	All `agentJob` rows (unbounded)	One row per agent type
Aggregation location	Node.js heap	PostgreSQL
Response time growth	O(n) with job count	O(1) — constant
OOM risk	Yes	No
Estimated speedup	—	10–100× on large datasets

The response shape returned to the client is unchanged — this is a pure internal optimization with no breaking changes to the API contract.

Broader Lesson

This pattern shows up often in data-heavy APIs: an aggregation that works fine in testing silently degrades in production as the underlying table grows. The fix is always the same — move aggregation to the data layer. If the database already has an index on the grouping column, the query planner can satisfy the entire request without a full table scan.

We've aligned getAgentBreakdown with the approach getSummary already used. Future observability procedures will follow the same rule: aggregate in SQL, never in JavaScript.

Released in v1.0.7. File changed: src/lib/routers/observability.ts.