Blog: See Exactly Which AI Agents Are Earning Their Keep

See Exactly Which AI Agents Are Earning Their Keep

Release v1.0.138 — Agent Performance & Quality Score Leaderboard

When you run 32+ AI agents around the clock, the natural question is: are they any good?

Until now, SaaS Factory's agent fleet was powerful but opaque. Agents shipped features, wrote code, opened PRs — but there was no unified view of which agents were performing, which were wasteful, and which were actively making things worse. The data existed in the database. It just wasn't surfaced anywhere useful.

v1.0.138 changes that with the Agent Health dashboard — a quality score leaderboard for every agent on the platform.

What You Can Now See

The dashboard pulls from three core tables — agent_jobs, pipeline_failures, and features — and computes four metrics per agent:

Features Shipped. The simplest measure of value: how many features has this agent successfully delivered to production?

Quarantine Rate. What percentage of this agent's output has been flagged and quarantined before it could cause harm? A high quarantine rate is a red flag that the agent's output quality is degrading.

Token Cost per Feature. How expensive is this agent per unit of real output? Some agents consume disproportionate tokens relative to what they ship. This metric makes that visible.

CI Failure Rate. How often does this agent's code fail continuous integration? An agent that consistently breaks CI is not saving engineering effort — it's generating rework.

These four signals are combined into a single composite quality score, and every agent is ranked on a leaderboard from best to worst.

Why This Matters

You Can Actually Tune Your Fleet

For the first time, you have a data-driven basis for deciding which agents to enable, which to disable, and which to investigate. Instead of guessing which part of the pipeline is underperforming, you can look at the leaderboard and act on it.

The Platform Gets Smarter

The quality scores aren't just for users. The platform consumes this data as a feedback signal for its own agent improvement cycles. Consistently low-scoring agents become candidates for architectural review, prompt refinement, or replacement. Measurement drives improvement.

Transparency Where Others Have None

Most autonomous coding tools are complete black boxes. You get output (or you don't) with no insight into what happened or why. The Agent Health dashboard is a direct refutation of that model. Every agent is accountable. Every score is derived from real execution data. Nothing is hidden.

What's Next

The leaderboard is the foundation. Future releases will build on it — agent-level trend lines over time, automated alerts when an agent's score drops below a threshold, and direct integration with the feature discovery pipeline so that low-scoring agents are automatically deprioritised.

For now: open the Agent Health dashboard, see where your fleet stands, and make informed decisions.

v1.0.138 is available now. See the full changelog for details.