Karpathy's March of Nines: Why 90% AI Reliability Falls Short

Andrej Karpathy‘s “March of Nines” offers a blunt mathematical case for why enterprise AI teams building on demo-level reliability are, by his framing, nowhere near ready for production.

Contents

Turning Reliability Into Measurable Targets
Nine Engineering Levers

The core argument: a workflow that succeeds 90% of the time has merely reached the first nine. Each additional nine — moving from 90% to 99%, then 99.9%, then 99.99% — demands roughly equivalent engineering effort, according to the announcement. “Every single nine is the same amount of work,” Karpathy has said.

The math compounds fast. In a 10-step agentic workflow — covering intent parsing, context retrieval, planning, tool calls, validation, formatting, and audit logging — end-to-end success equals per-step reliability raised to the power of 10. A workflow where each step succeeds 90% of the time completes successfully just 34.87% of the time overall. At 10 workflows per day, that means roughly 6.5 interruptions daily. The report classifies that as “prototype territory.”

Push per-step reliability to 99% and the 10-step workflow still fails roughly once per day. At 99.9% per step, failures drop to approximately once every 10 days — still frequent enough to feel unreliable. Only at 99.99% per step does the workflow hit roughly one failure every 3.3 months, the threshold the analysis describes as “dependable enterprise-grade software.”

Turning Reliability Into Measurable Targets

The framework recommends converting reliability into service-level objectives before choosing any technical fix. Suggested indicators include workflow completion rate, tool-call success within defined timeouts, schema-valid output rate, policy compliance rate covering PII and security constraints, p95 end-to-end latency, cost per workflow, and fallback rate to safer models or human review.

Error budgets should govern experimentation, with SLO targets tiered by workflow impact — low, medium, or high.

Nine Engineering Levers

The analysis identifies specific controls for adding reliability incrementally:

Constrain autonomy with an explicit workflow graph — a state machine or DAG where each node defines allowed tools, retry limits, and success conditions, with idempotent keys enabling safe replays.
Enforce contracts at every boundary using JSON Schema or protobuf validation before any tool executes, with canonical IDs and normalized time (ISO-8601) and units (SI).
Layer validators across syntax, semantics, and business rules — schema checks catch formatting; semantic checks handle referential integrity and numeric bounds; business rules gate write actions and enforce data residency.
Route by risk using confidence signals, classifiers, or second-model verifiers to direct high-impact actions toward stronger models or human approval.
Treat tool calls as distributed systems — per-tool timeouts, backoff with jitter, circuit breakers, concurrency limits, and versioned schemas to catch silent API breakage.
Make retrieval observable as a versioned data product, tracking empty-retrieval rates and coverage metrics to keep responses grounded.

The underlying point is structural. Correlated failures — authentication outages, rate limits, connector instability — tend to dominate in production unless shared dependencies are hardened independently. “When you get a demo and something works 90% of the time, that’s just the first nine,” Karpathy said. The gap between that and software users trust is not a prompt fix.

Photo by Tyler on Unsplash

This article is a curated summary based on third-party sources. Source: Read the original article

Turning Reliability Into Measurable Targets

More Read

Nine Engineering Levers

All the latest Foxiz news straight to your inbox​

All the latest Foxiz news straight to your inbox