The Ratio
A weekly newsletter on reliability economics
The Number
3 of the 4
Three of the four most common AI uses in reliability only kick in after something breaks.
The four most common AI uses in reliability: triage, log summarization, runbooks, drift detection. Three kick in after something breaks. Only drift detection prevents, and it's the least adopted.
The Ratio Take
AI only at the point of response makes failures cheaper to survive, not less likely. Painkillers for a fracture. The teams that pull ahead won't have the fastest response. They'll have the failures that never happened.
AI makes failures cheaper to survive, not less likely to happen.
This Week in Reliability
AI SRE Goes Production
Enterprise teams are moving beyond AI coding assistants to autonomous AI agents managing production infrastructure, incident response, and capacity optimization — shifting reliability spend from reactive firefighting to preventive, agent-driven operations at scale.
Deep Reads
The on-call cost of AI-generated code
Great Circle (Brent Chapman) · Primary evidence — the operational debt
AI-generated code is creating new operational burden for on-call engineers who must support systems they didn't write and often don't fully understand. The velocity gains from AI coding tools are generating downstream costs in incident response and system comprehension.
The Ratio Take
This is the dark side of the AI productivity story nobody budgets for. Every line of AI-generated code that ships without human comprehension becomes technical debt that lands on the on-call rotation. The prevention question isn't whether AI writes good code — it's whether your SREs can operate what AI builds at 3am.
AI writes fast; humans debug slow.
AI demands more engineering discipline. Not less
Charity Majors · Strategic framework — discipline as prevention
The shift to AI-generated infrastructure mirrors the earlier transition from handcrafted servers to immutable infrastructure. AI code generation requires more rigorous engineering practices, observability, and testing — not less — because the blast radius of errors grows with automation velocity.
The Ratio Take
The immutable infrastructure playbook is the blueprint for AI operations: you can't handcraft your way out of scale. AI agents shipping code demand the same discipline that made cloud-native work — CI/CD gates, observability by default, and automated rollback. Skip those and you're trading artisanal ops for industrial-scale chaos.
AI makes bad process fast, not good.
How AI SRE Works: A Three-Stage Workflow for Enterprise Infrastructure
StackGen · Vendor response — productizing AI SRE
StackGen outlines a three-stage AI SRE workflow: detection (identifying issues through observability), diagnosis (root cause via dependency graphs), and remediation (autonomous fixes with human approval gates).
This is vendor positioning for the autonomous ops category — the workflow is sound but the ROI hinges on whether diagnosis is actually autonomous or just faster ticket routing.
Three stages: detect, diagnose, remediate.
Komodor Brings Autonomous AI to SRE With Reliability-First Cloud Optimization
Komodor · Vendor signal — predictive capacity
Komodor's AI-based capacity intelligence predicts resource placement and structural inefficiencies to prevent cloud waste, claiming up to 80% cost savings while maintaining SRE-level reliability standards.
Predictive placement is prevention infrastructure — if it works, this moves spend from reactive rightsizing firefights to upfront capacity modeling that stops waste before deployment.
Predict placement, prevent waste.
AI SRE Summit 2026
Cloud Native Now · Market signal — enterprise momentum
The AI SRE Summit convenes enterprise teams using AI for incident automation, toil elimination, and reliability-first cloud cost optimization in production environments.
When vendors organize a summit, the category is real — enterprise teams are already running AI SRE in prod, and the conversation has moved from 'if' to 'how much autonomy'.
Summit signals production adoption.
The 4-Body Problem of SRE: Building an Agentic OS for Autonomous Operations
StackGen · Adjacent — agent coordination challenge
StackGen frames autonomous SRE as a coordination challenge across four interacting systems (workloads, infrastructure, agents, and humans), arguing that an 'Agentic OS' layer is needed to orchestrate multi-agent operations at scale.
The 4-body metaphor is good — agent coordination is where most autonomous ops pilots fail, not agent capability. If you're running multiple AI agents in production, the missing piece isn't smarter models; it's a control plane that keeps them from stepping on each other.
Agents need orchestration, not just intelligence.
Break our demo infrastructure on purpose and watch the root cause surface
Anyshift · Adjacent — change-driven incident response
Anyshift's Playground lets users sever database connections in demo infrastructure and observe change-first root cause analysis trace incidents back to topology changes in seconds, without log searching.
Interactive demos are how new reliability categories prove value — if change-first RCA is faster than log-based triage for AI-generated infrastructure, this is the 'show don't tell' moment that moves budget.
Demo the root cause before the logs.
The Crowd Favorite
The SRE Playlist
-
Mr. Vain - Original Radio Edit - Culture Beat ↗ — "92% auto-resolved" is vanity if the 8% is your revenue.
-
No Limit - 2 Unlimited ↗ — An agent with no blast radius is an outage with admin rights.
-
Insomnia - Radio Edit - Faithless ↗ — Escalates everything it can't classify? You kept the 3 a.m. pages.
-
Higher State of Consciousness (Tweekin Acid Funk) - Josh Wink ↗ — Autonomy is earned. It observes correctly before it gets to act.
-
Children - Robert Miles ↗ — A prod agent is a junior who's never seen an outage. Read before write.
3 A.M. and Autonomous
The Challenger — Industry Shift
The shift everyone's naming wrong.
The story is the same everywhere: AI drives the cost of writing code to zero. The flood is here. More software than anyone can run.
That's true. It's also not the shift.
Cheap code is the trigger. The shift is what happens next. If software multiplies, something has to operate all of it. The build side got a hundred times cheaper. The run side got nothing.
That's the cost nobody's pricing. On an agent team, tokens are 0.2% of the bill. Ninety-three percent is people.
Cheap code. Expensive to keep alive.
The Ratio is a weekly newsletter by Florian Hoeppner.
Take the assessment → reliabilityeconomics.com/benchmark
Reply to this email with your take.