Issue №6June 16, 2026

The Ratio

A weekly newsletter on reliability economics

The Number

3 of the 4

Three of the four most common AI uses in reliability only kick in after something breaks.

The four most common AI uses in reliability: triage, log summarization, runbooks, drift detection. Three kick in after something breaks. Only drift detection prevents, and it's the least adopted.

The Ratio Take

AI only at the point of response makes failures cheaper to survive, not less likely. Painkillers for a fracture. The teams that pull ahead won't have the fastest response. They'll have the failures that never happened.

AI makes failures cheaper to survive, not less likely to happen.

This Week in Reliability

AI SRE Goes Production

Enterprise teams are moving beyond AI coding assistants to autonomous AI agents managing production infrastructure, incident response, and capacity optimization — shifting reliability spend from reactive firefighting to preventive, agent-driven operations at scale.

Deep Reads

The on-call cost of AI-generated code

Great Circle (Brent Chapman) · Primary evidence — the operational debt

AI-generated code is creating new operational burden for on-call engineers who must support systems they didn't write and often don't fully understand. The velocity gains from AI coding tools are generating downstream costs in incident response and system comprehension.

The Ratio Take

This is the dark side of the AI productivity story nobody budgets for. Every line of AI-generated code that ships without human comprehension becomes technical debt that lands on the on-call rotation. The prevention question isn't whether AI writes good code — it's whether your SREs can operate what AI builds at 3am.

AI writes fast; humans debug slow.

AI demands more engineering discipline. Not less

Charity Majors · Strategic framework — discipline as prevention

The shift to AI-generated infrastructure mirrors the earlier transition from handcrafted servers to immutable infrastructure. AI code generation requires more rigorous engineering practices, observability, and testing — not less — because the blast radius of errors grows with automation velocity.

The Ratio Take

The immutable infrastructure playbook is the blueprint for AI operations: you can't handcraft your way out of scale. AI agents shipping code demand the same discipline that made cloud-native work — CI/CD gates, observability by default, and automated rollback. Skip those and you're trading artisanal ops for industrial-scale chaos.

AI makes bad process fast, not good.

How AI SRE Works: A Three-Stage Workflow for Enterprise Infrastructure

StackGen · Vendor response — productizing AI SRE

StackGen outlines a three-stage AI SRE workflow: detection (identifying issues through observability), diagnosis (root cause via dependency graphs), and remediation (autonomous fixes with human approval gates).

The Ratio Take

This is vendor positioning for the autonomous ops category — the workflow is sound but the ROI hinges on whether diagnosis is actually autonomous or just faster ticket routing.

Three stages: detect, diagnose, remediate.

Komodor Brings Autonomous AI to SRE With Reliability-First Cloud Optimization

Komodor · Vendor signal — predictive capacity

Komodor's AI-based capacity intelligence predicts resource placement and structural inefficiencies to prevent cloud waste, claiming up to 80% cost savings while maintaining SRE-level reliability standards.

The Ratio Take

Predictive placement is prevention infrastructure — if it works, this moves spend from reactive rightsizing firefights to upfront capacity modeling that stops waste before deployment.

Predict placement, prevent waste.

AI SRE Summit 2026

Cloud Native Now · Market signal — enterprise momentum

The AI SRE Summit convenes enterprise teams using AI for incident automation, toil elimination, and reliability-first cloud cost optimization in production environments.

The Ratio Take

When vendors organize a summit, the category is real — enterprise teams are already running AI SRE in prod, and the conversation has moved from 'if' to 'how much autonomy'.

Summit signals production adoption.

The 4-Body Problem of SRE: Building an Agentic OS for Autonomous Operations

StackGen · Adjacent — agent coordination challenge

StackGen frames autonomous SRE as a coordination challenge across four interacting systems (workloads, infrastructure, agents, and humans), arguing that an 'Agentic OS' layer is needed to orchestrate multi-agent operations at scale.

The Ratio Take

The 4-body metaphor is good — agent coordination is where most autonomous ops pilots fail, not agent capability. If you're running multiple AI agents in production, the missing piece isn't smarter models; it's a control plane that keeps them from stepping on each other.

Agents need orchestration, not just intelligence.

Break our demo infrastructure on purpose and watch the root cause surface

Anyshift · Adjacent — change-driven incident response

Anyshift's Playground lets users sever database connections in demo infrastructure and observe change-first root cause analysis trace incidents back to topology changes in seconds, without log searching.

The Ratio Take

Interactive demos are how new reliability categories prove value — if change-first RCA is faster than log-based triage for AI-generated infrastructure, this is the 'show don't tell' moment that moves budget.

Demo the root cause before the logs.

The Crowd Favorite

The SRE Playlist

Mr. Vain - Original Radio Edit - Culture Beat ↗ — "92% auto-resolved" is vanity if the 8% is your revenue.
No Limit - 2 Unlimited ↗ — An agent with no blast radius is an outage with admin rights.
Insomnia - Radio Edit - Faithless ↗ — Escalates everything it can't classify? You kept the 3 a.m. pages.
Higher State of Consciousness (Tweekin Acid Funk) - Josh Wink ↗ — Autonomy is earned. It observes correctly before it gets to act.
Children - Robert Miles ↗ — A prod agent is a junior who's never seen an outage. Read before write.

3 A.M. and Autonomous

The Challenger — Industry Shift

The shift everyone's naming wrong.

The story is the same everywhere: AI drives the cost of writing code to zero. The flood is here. More software than anyone can run.

That's true. It's also not the shift.

Cheap code is the trigger. The shift is what happens next. If software multiplies, something has to operate all of it. The build side got a hundred times cheaper. The run side got nothing.

That's the cost nobody's pricing. On an agent team, tokens are 0.2% of the bill. Ninety-three percent is people.

Cheap code. Expensive to keep alive.

The Ratio is a weekly newsletter by Florian Hoeppner.

Take the assessment → reliabilityeconomics.com/benchmark
Reply to this email with your take.

Our weekly newsletter on reliability economics.

The Ratio

3 of the 4

AI SRE Goes Production

The on-call cost of AI-generated code

AI demands more engineering discipline. Not less

How AI SRE Works: A Three-Stage Workflow for Enterprise Infrastructure

Komodor Brings Autonomous AI to SRE With Reliability-First Cloud Optimization

AI SRE Summit 2026

The 4-Body Problem of SRE: Building an Agentic OS for Autonomous Operations

Break our demo infrastructure on purpose and watch the root cause surface

The SRE Playlist

The shift everyone's naming wrong.

Our weekly newsletter on reliability economics.