Issue №8June 30, 2026

The Ratio

A weekly newsletter on reliability economics

The Number

2 of 101

Only 2 of 101 organizations in the benchmark spend more than half their reliability budget on prevention — the rest spend the majority reacting to failures after they happen.

The benchmark tracks how reliability budgets split between prevention and reactive response. 50 organizations put 25–49% toward prevention. 26 put 10–24%. 23 sit under 10%. Only 2 of 101 spend 50% or more on prevention. That's less than 2%. The other 98% spend most of their reliability money after something has already broken.

The Ratio Take:Firefighting

Organize your budget around response and you will get very good at response. That's it. The two organizations past the 50% threshold aren't chasing perfection. They're the only ones in this dataset whose spending is pointed at preventing failure instead of surviving it. This is not a marginal gap. It is the industry default, written in budget lines. Think of it like a fire department that spends 98 cents of every dollar on trucks and hoses and 2 cents on building codes. The fires keep coming. Nobody asks why. Anyway. The structure of the spend is the strategy, whether anyone intended it or not.

98% of organizations spend most of their reliability budget on failures that already happened.

This Week in Reliability

AI Agents Need Production Guardrails

Multiple vendors shipped AI-driven incident response and production automation this week, but the pattern is clear: autonomous agents without production context create new risks. The reliability economics question isn't whether AI can close incidents faster—it's whether you're trading MTTR for incident frequency.

Deep Reads

Creating an agentic feedback loop with reliability guardrails

Gremlin · Core framework

Gremlin argues reliability guardrails are essential for AI development resilience, and proposes using them to create feedback loops that provide context to AI agents.

The Ratio Take:Prevention

This is the first vendor article to explicitly frame guardrails as bidirectional: they constrain what agents can break AND they teach agents what production actually tolerates. That's the shift from 'AI that closes tickets faster' to 'AI that learns why tickets happen.'

Guardrails as teacher, not just constraint.

Background Agents for Production Engineering

Resolve AI · Primary vendor example

Resolve AI launched background agents that handle ongoing production tasks like deployment monitoring, health checks, and operational reports without requiring always-on engineers. The platform is expanding to capacity management, cost analysis, and alert tuning.

The Ratio Take:Prevention

This is the reliability economics bet: shift from on-demand human attention to continuous automated observation. The ROI case depends entirely on whether these agents prevent incidents or just generate better reports during them. If they're tuning alerts before pages fire, that's prevention spend. If they're summarizing what broke, that's documentation theater.

Agents that watch production so engineers don't have to.

AI SRE: Automated Root Cause Analysis for Incident Response

Rootly · Vendor response—reactive tooling

Rootly's AI SRE is built into their incident management platform with full context across services, ownership, schedules, and incident history before investigation starts.

The Ratio Take:Firefighting

Built-in context beats bolted-on AI—this is the production-aware design pattern that matters.

Context-first AI, not chatbot-in-Slack RCA.

From Incidents to Insight: Closing the Post-Incident Review Gap

PagerDuty · Adjacent—learning loop

PagerDuty is connecting post-incident reviews back into their platform ecosystem so insights actively improve future incident detection, triage, and resolution rather than sitting in documents.

The Ratio Take:Prevention

PIRs that feed back into alerting and runbooks could close the loop—but only if the feedback actually changes system behavior, not just Slack templates.

PIR insights that change detection, not docs.

Anyshift meets Harness: production-aware approvals

Anyshift · Adjacent—pre-incident context

Anyshift now adds production impact context—affected services, owners, recent changes, and review decisions—to Harness manual approval gates before deployments wait for human sign-off.

The Ratio Take:Prevention

This is the pre-deployment guardrail: agents that know what's already broken before you deploy the next thing.

Deploy knowing what you're about to hit.

How GRAIL replaced a manual, ad hoc incident process with Rootly and cut manual effort by 80%

Rootly · Case study—guardrails in practice

GRAIL reduced manual incident effort by 80% after adopting Rootly, with the biggest win being process guardrails that enforce severity levels and required information from the start, not automation alone.

The Ratio Take:Firefighting

The guardrails force discipline—automation just makes compliance cheaper.

Process structure beats raw automation speed.

OTel and mesh-derived metrics: A 2026 reference

CNCF Blog · Infrastructure enabler

OpenTelemetry pipelines give good application visibility but miss east-west service traffic; service mesh metrics fill the gap by measuring inter-service behavior at the network layer.

The Ratio Take:The Ratio

Agent context depends on complete telemetry—this is the blind spot between what apps emit and what networks see.

You can't guard what you can't see.

The Crowd Favorite

Running Up That Hill (A Deal With God) — Kate Bush ↗ — Synchronized retries after a circuit opens spike origin traffic past the original load. Jitter is mandatory, not optional.
Ring of Fire — Johnny Cash ↗ — No tested rollback means no rollback. You discover that at 2 a.m. with revenue falling.
Money for Nothing — Dire Straits ↗ — Every manual runbook step is toil tax. Toil tax is why prevention debt compounds quarter over quarter.
Back In Black — AC/DC ↗ — Your real MTTR is whatever you clocked last game day. Skip rehearsals and that number is a guess dressed as a metric.
Eye of the Tiger — Survivor ↗ — Burn rate is the number. Remaining error budget is the distraction.

The Ratio Take:Prevention

Rehearse failure before production teaches it

The Challenger — Tool of the Week

Dash0 is an OpenTelemetry-native observability platform. Its Agent0 layer investigates incidents and can generate pull requests from production telemetry. The pitch: open standards reduce lock-in, and Dash0's consumption-based pricing is easier to reason about than several separate billing meters. Caveat: Agent0 only became GA on June 1, 2026, so treat autonomy as early. PRs still need engineering review. This is faster triage, not incident prevention. Without preventive engineering you've hired a better firefighter, not installed smoke detectors.

The Ratio Take:Firefighting

Faster triage, not autonomous resilience.

The Ratio is a weekly newsletter by Florian Hoeppner.

Take the assessment → reliabilityeconomics.com/benchmark
Reply to this email with your take.

Our weekly newsletter on reliability economics.

The Ratio

2 of 101

AI Agents Need Production Guardrails

Creating an agentic feedback loop with reliability guardrails

Background Agents for Production Engineering

AI SRE: Automated Root Cause Analysis for Incident Response

From Incidents to Insight: Closing the Post-Incident Review Gap

Anyshift meets Harness: production-aware approvals

How GRAIL replaced a manual, ad hoc incident process with Rootly and cut manual effort by 80%

OTel and mesh-derived metrics: A 2026 reference

Our weekly newsletter on reliability economics.