All Docs
FeaturesSaaS FactoryUpdated February 19, 2026

Blog: How We Fixed a Silent OOM Bomb in Agent Observability

How We Fixed a Silent OOM Bomb in Agent Observability

v1.0.7 | Performance | observability.getAgentBreakdown


As SaaS Factory runs more agents across more projects, the observability layer has to keep up. In v1.0.7 we patched a subtle but serious performance anti-pattern in the getAgentBreakdown procedure — one that would have caused increasingly slow dashboards and eventual out-of-memory crashes as the platform scaled.

The Problem: Load Everything, Aggregate Nothing

The getAgentBreakdown tRPC procedure is responsible for the per-agent-type breakdown you see on the Observability dashboard: how many jobs each agent type has run, how many succeeded or failed, token usage, and average duration.

The original implementation worked like this:

  1. Fetch every agentJob row for all projects the caller can access — no limit, no aggregation.
  2. Loop through the results in JavaScript, building up a Map<agentType, stats> on the Node.js server.
  3. Return the aggregated map to the client.

This is the classic load-then-aggregate anti-pattern. It looks harmless in development when the database has a few hundred rows. In production, with tens of thousands of agent job records across many projects, it means:

  • Massive data transfer between Postgres and the application server on every dashboard load.
  • Unbounded heap growth on the Node.js process — each row is a deserialized object held in memory during the loop.
  • O(n) response time that degrades linearly as job history accumulates.
  • A real OOM risk with no natural ceiling.

For comparison, the sibling procedure getSummary was already doing this correctly, using sql<number> aggregate filters pushed entirely into the database.

The Fix: Push Aggregation Into Postgres

The solution is a single SQL query that lets PostgreSQL do what it was built for:

SELECT
  agentType,
  COUNT(*)                                                         AS totalJobs,
  COUNT(*) FILTER (WHERE status = 'completed')                     AS completed,
  COUNT(*) FILTER (WHERE status = 'failed')                        AS failed,
  COALESCE(SUM((tokenUsage->>'inputTokens')::int),  0)             AS inputTokens,
  COALESCE(SUM((tokenUsage->>'outputTokens')::int), 0)             AS outputTokens,
  AVG(EXTRACT(EPOCH FROM (completedAt - startedAt)))               AS avgDurationSec
FROM agentJobs
JOIN pipelineRuns ...
GROUP BY agentType;

Key techniques used:

  • COUNT(*) FILTER (WHERE ...) — Standard SQL conditional aggregation. Computes multiple counts in a single scan instead of requiring separate queries per status.
  • COALESCE(SUM(...)::int, 0) — Safely handles rows where tokenUsage is null without spreading null into the result.
  • AVG(EXTRACT(EPOCH FROM ...)) — Computes average duration in seconds entirely in the database, no JavaScript Date arithmetic needed.
  • GROUP BY agentType — Returns exactly one row per distinct agent type, regardless of how many millions of underlying job rows exist.

The Result

MetricBeforeAfter
Rows transferredAll agentJob rows (unbounded)One row per agent type
Aggregation locationNode.js heapPostgreSQL
Response time growthO(n) with job countO(1) — constant
OOM riskYesNo
Estimated speedup10–100× on large datasets

The response shape returned to the client is unchanged — this is a pure internal optimization with no breaking changes to the API contract.

Broader Lesson

This pattern shows up often in data-heavy APIs: an aggregation that works fine in testing silently degrades in production as the underlying table grows. The fix is always the same — move aggregation to the data layer. If the database already has an index on the grouping column, the query planner can satisfy the entire request without a full table scan.

We've aligned getAgentBreakdown with the approach getSummary already used. Future observability procedures will follow the same rule: aggregate in SQL, never in JavaScript.


Released in v1.0.7. File changed: src/lib/routers/observability.ts.