Blog: How We Fixed a Silent OOM Bomb in Agent Observability
How We Fixed a Silent OOM Bomb in Agent Observability
v1.0.7 | Performance | observability.getAgentBreakdown
As SaaS Factory runs more agents across more projects, the observability layer has to keep up. In v1.0.7 we patched a subtle but serious performance anti-pattern in the getAgentBreakdown procedure — one that would have caused increasingly slow dashboards and eventual out-of-memory crashes as the platform scaled.
The Problem: Load Everything, Aggregate Nothing
The getAgentBreakdown tRPC procedure is responsible for the per-agent-type breakdown you see on the Observability dashboard: how many jobs each agent type has run, how many succeeded or failed, token usage, and average duration.
The original implementation worked like this:
- Fetch every
agentJobrow for all projects the caller can access — no limit, no aggregation. - Loop through the results in JavaScript, building up a
Map<agentType, stats>on the Node.js server. - Return the aggregated map to the client.
This is the classic load-then-aggregate anti-pattern. It looks harmless in development when the database has a few hundred rows. In production, with tens of thousands of agent job records across many projects, it means:
- Massive data transfer between Postgres and the application server on every dashboard load.
- Unbounded heap growth on the Node.js process — each row is a deserialized object held in memory during the loop.
- O(n) response time that degrades linearly as job history accumulates.
- A real OOM risk with no natural ceiling.
For comparison, the sibling procedure getSummary was already doing this correctly, using sql<number> aggregate filters pushed entirely into the database.
The Fix: Push Aggregation Into Postgres
The solution is a single SQL query that lets PostgreSQL do what it was built for:
SELECT
agentType,
COUNT(*) AS totalJobs,
COUNT(*) FILTER (WHERE status = 'completed') AS completed,
COUNT(*) FILTER (WHERE status = 'failed') AS failed,
COALESCE(SUM((tokenUsage->>'inputTokens')::int), 0) AS inputTokens,
COALESCE(SUM((tokenUsage->>'outputTokens')::int), 0) AS outputTokens,
AVG(EXTRACT(EPOCH FROM (completedAt - startedAt))) AS avgDurationSec
FROM agentJobs
JOIN pipelineRuns ...
GROUP BY agentType;
Key techniques used:
COUNT(*) FILTER (WHERE ...)— Standard SQL conditional aggregation. Computes multiple counts in a single scan instead of requiring separate queries per status.COALESCE(SUM(...)::int, 0)— Safely handles rows wheretokenUsageis null without spreadingnullinto the result.AVG(EXTRACT(EPOCH FROM ...))— Computes average duration in seconds entirely in the database, no JavaScriptDatearithmetic needed.GROUP BY agentType— Returns exactly one row per distinct agent type, regardless of how many millions of underlying job rows exist.
The Result
| Metric | Before | After |
|---|---|---|
| Rows transferred | All agentJob rows (unbounded) | One row per agent type |
| Aggregation location | Node.js heap | PostgreSQL |
| Response time growth | O(n) with job count | O(1) — constant |
| OOM risk | Yes | No |
| Estimated speedup | — | 10–100× on large datasets |
The response shape returned to the client is unchanged — this is a pure internal optimization with no breaking changes to the API contract.
Broader Lesson
This pattern shows up often in data-heavy APIs: an aggregation that works fine in testing silently degrades in production as the underlying table grows. The fix is always the same — move aggregation to the data layer. If the database already has an index on the grouping column, the query planner can satisfy the entire request without a full table scan.
We've aligned getAgentBreakdown with the approach getSummary already used. Future observability procedures will follow the same rule: aggregate in SQL, never in JavaScript.
Released in v1.0.7. File changed: src/lib/routers/observability.ts.